Abstract
In bandit problems as well as in Monte-Carlo tree search
(MCTS), variance-based policies such as UCB-V are reported
to show better performance in practice than policies
that ignore variance information, such as UCB1. For bandits,
UCB-V was proved to exhibit somewhat better convergence
properties than UCB1. In contrast, for MCTS so far no convergence
guarantees have been established for UCB-V. Our
first contribution is to show that UCB-V provides the same
convergence guarantees in MCTS that are known for UCB1.
Another open problem with variance-based policies in MCTS
is that they can only be used in conjunction with Monte-Carlo
backups but not with the recently suggested and increasingly
popular dynamic programming (DP) backups. This is because
standard DP backups do not propagate variance information.
Our second contribution is to derive update equations for the
variance in DP backups, which significantly extends the applicability
of variance-based policies in MCTS.
Finally, we provide an empirical analysis of UCB-V and
UCB1 in two prototypical environments showing that UCB-V
significantly outperforms UCB1 both with Monte-Carlo as
well as with dynamic programming backups.
Original language | English |
---|---|
Number of pages | 9 |
Publication status | Published - 20 Jun 2017 |
Event | ICAPS workshop: Heuristics and Search for Domain-independent Planning (HSDIP) - Duration: 20 Jun 2017 → … |
Conference
Conference | ICAPS workshop: Heuristics and Search for Domain-independent Planning (HSDIP) |
---|---|
Period | 20/06/2017 → … |