Exploiting Variance Information in Monte-Carlo Tree Search

Robert Lieck, Vien Ngo, Marc Toussaint

Research output: Contribution to conferencePaperpeer-review

Abstract

In bandit problems as well as in Monte-Carlo tree search (MCTS), variance-based policies such as UCB-V are reported to show better performance in practice than policies that ignore variance information, such as UCB1. For bandits, UCB-V was proved to exhibit somewhat better convergence properties than UCB1. In contrast, for MCTS so far no convergence guarantees have been established for UCB-V. Our first contribution is to show that UCB-V provides the same convergence guarantees in MCTS that are known for UCB1. Another open problem with variance-based policies in MCTS is that they can only be used in conjunction with Monte-Carlo backups but not with the recently suggested and increasingly popular dynamic programming (DP) backups. This is because standard DP backups do not propagate variance information. Our second contribution is to derive update equations for the variance in DP backups, which significantly extends the applicability of variance-based policies in MCTS. Finally, we provide an empirical analysis of UCB-V and UCB1 in two prototypical environments showing that UCB-V significantly outperforms UCB1 both with Monte-Carlo as well as with dynamic programming backups.
Original languageEnglish
Number of pages9
Publication statusPublished - 20 Jun 2017
EventICAPS workshop: Heuristics and Search for Domain-independent Planning (HSDIP) -
Duration: 20 Jun 2017 → …

Conference

ConferenceICAPS workshop: Heuristics and Search for Domain-independent Planning (HSDIP)
Period20/06/2017 → …

Fingerprint

Dive into the research topics of 'Exploiting Variance Information in Monte-Carlo Tree Search'. Together they form a unique fingerprint.

Cite this