Case study: AlphaGo

As part of a Case Studies seminar at ETH, I recently studied and presented the paper behind AlphaGo, titled “Mastering the game of Go with deep neural networks and tree search”. This is a summary of my findings about this sophisticated AI software.

Google DeepMind made quite a splash in 2016 when their AI AlphaGo defeated the Go world champion, Lee Sedol. To quote the abstract of DeepMind’s paper:

This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

As I first started reading about AlphaGo, there were two questions that naturally came to mind: what is this game, Go, and what makes it so hard for AI?

What is Go?

I’m not an expert go player, but here’s the general idea: two players (white and black) take turns to place stones on any of the unoccupied positions of a 19×19 board. Stones are placed on the intersections of the board (as opposed to the cells, as in chess). There are also smaller versions of the board, but the real challenge is the full-sized version of the game.
The objective of the game is to control as much territory (intersections) as possible. A player controls all intersections under stones of his colour or completely surrounded by stones of his colour.
Furthermore, a stone or group of adjacent stones completely surrounded by stones of the opponent’s colour is removed from play.
Players can pass a turn if they wish, and the game ends when both players pass consecutively.

There are many variations of the rules and some details, but this is the general gist of the game. As you can see, it’s actually fairly simple: there are no real limitations of where we can play a stone. As we’ll see, the simplicity of these rules, as well as the size of the board, is key to the difficulty of the game.

Why is Go so hard for AI?

We already know that there are algorithms that can beat world champions in many other games, such as chess or Scrabble. Most famously, IBM’s DeepBlue defeated chess grandmaster Garry Kasparov (who is still considered one of the best chess players ever) in 1997. So isn’t defeating Go professionals simply a question of applying the same strategy, but using a more powerful computer?

Games of perfect information

As far as game theory is concerned, Go can be modelled as a two-player, alternating, zero-sum game of perfect information.

A game of perfect information is one where each player knows everything there is to know about the state of the game at every point in the game. Chess is a good example: just one glance at the board positions is enough to know everything about the game in progress. Poker, on the other hand, is a game of imperfect information, since you as a player don’t know what other players have in their hand.

Two-player alternating is fairly self-explanatory: there are two players, and they play in alternating turns.

Finally, a zero-sum game is one where every gain for one player has an equal and opposite loss for the opponent. Hence the sum of the rewards is zero. In Go, this means we only consider terminal rewards, i.e., whether a player loses or wins. This is a lazy approach: by how much we win is not relevant, as long as we win.

The optimal value function

Zero-sum games have a unique optimal value function $v^*(s)$ which, given a state $s$ of the game (board position), predicts who will win the game, assuming perfect play from both players. If we had some black box that told us the output of this function, we could win any game: at each turn, just make a move that brings you into a state for which $v^*$ predicts a win.

Of course, we don’t have such a convenient black box, but we could still win, in principle, by trying out every possible sequence of moves in our head (or RAM), and then picking one which ends up in a win. This would generate a search tree of about $b^d$ possible sequences of moves. $b$ is the branching factor, i.e. the typical number of moves available to us, and $d$ is the depth of the tree, i.e. the typical number of moves in a full game.

As you can already guess, this strategy is completely infeasible for anything more complex than tic tac toe. Imagine a game where $b \approx 10$ and $d \approx 100$. One terabyte is in the order of $10^{12}$ bytes, so just storing the search tree for such a game would require something in the order of $10^{88}$TB of RAM. Which is, of course, ludicrous.
In chess, $b \approx 35$ and $d \approx 80$, so exhaustive search is already absurd. In Go, $b \approx 250$ and $d \approx 150$, which makes for an unimaginably larger tree.

Search space reduction

So, how can we reduce this search space to a manageable size? We basically have two options: one is to reduce the depth $d$ of the tree, the other is to reduce the number of branches we check at each move (thus reducing the branching factor $b$).

To reduce the depth $d$, we will only evaluate a certain number of moves down each path. Then we stop simulating the game, and use some kind of value function $v$ to approximate the optimal value function, $v(s) \approx v^*(s)$. In other words, we use some criterion to tell which player will win the game by just looking at the board (or in general the state $s$). Typically, this function returns a number between $0$ and $1$, representing the probability of winning for a certain player.
This approach of course still leaves us with a tough question: how to find a good value function $v(s)$? In chess, a naïve approach would be to attribute a value to each piece, then say that the player with the highest total value on the board will win. This is of course not a very good metric – it completely ignores board positions. In the case of Deep Blue, this function was extremely sophisticated, hand-crafted by experts in the game and, to an extent, calibrated by looking at recorded professional games.

To reduce the branching factor $b$, we want to eliminate moves that are obviously bad, and focus on the good ones. We do this by introducing a policy function $p$, which, given a state $s$, outputs a probability distribution over the set of possible actions $A$. We then sample the moves randomly, according to this distribution. Once again, we are faced with the difficult problem of determining a good policy.

Complexity vs complicatedness

And this finally brings us to the core of why Go is so much harder than chess for a computer (beyond the sheer size of the board): hand-crafting good policies and evaluation functions is doable for chess (as proven by DeepBlue), but it is generally considered nigh on impossible for Go. Why? One reason is that chess is arguably more complicated, but less complex than Go.
A complicated game has many rules, and its difficulty lies more in learning all the rules – once you have wrapped your mind around them, it is relatively easy to decide what a good move is. A complex game, on the other hand, may or may not have many rules, but even when you know them all, it’s really hard to decide what makes a good move. In a way, more rules simplify your choice of good moves, because they restrict your possible moves, and because they give you many criteria to evaluate a move by.
Chess has many rules: each piece moves in a different way, pawns have two ways of moving, plus you have castling et cetera. Go, on the other hand, can be fully explained in a couple of paragraphs. Add to that the bigger board, and the fact that there are virtually no illegal moves – and you can begin to understand why some times even grand masters say things like “I don’t know, it just felt like the right move”.

Enter reinforcement learning: MCTS

So can we somehow learn good value functions and policies “on the fly”? Reinforcement learning comes to our aid, with an algorithm called Monte Carlo Tree Search.
As the name suggests, this is a Monte Carlo method, which basically means that it tries to make a good guess based on repeated trials. Imagine simulating a game, where each player makes a completely random move: when you reach the end, you keep a mental note of who won. By simulating many such games, you will gradually get a feel for which initial move has the highest chances of bringing you to a win.
MCTS is basically a refined version of this brute-force approach. It consists of repeating four steps recursively, for as long as we have time to make our move. We gradually explore the search tree, keeping track of action values Q for each node (an estimate of the expected end-game reward after making that move) and the number of times n the algorithm has visited a node.

Alphago's version of MCTS

Selection: traverse the current tree by picking the move with the maximum action value Q at each step, until we reach a state that has unexplored moves (leaves of the tree). If our tree is empty (i.e. at the start of the game), go straight to b. The action value Q has been calculated by previous iterations of the algorithm, and is stored per state (board position, i.e. tree nodes).
Expansion: pick one of the unexplored tree leaves to evaluate. Calculate what state that action brings the game to, and initialize Q = unknown and n = 1.
Evaluation: calculate an estimate for the action value Q of the expanded move. We do this by performing a rollout: we simulate a game until the end, by making moves according to some policy p (called rollout policy – which could, in principle, be completely random.
Backup: retrace our steps back up the tree, incrementing each parent’s action value by the new move’s action value, and adding one to the visit count n.

There are a couple of caveats to the steps as indicated above. First of all, in reinforcement learning we always want to balance exploration versus exploitation: if we find a good move, the current algorithm will only ever explore beneath that move – but there may be an even better alternative (or, our estimate for the action values of the other moves may be off, making a good move look like a bad one by pure bad luck). So to encourage horizontal exploration as well, in the selection step we add a bonus $u(s, a)$ for trying actions that have been tried less often:
$$a_t = \operatorname{argmax}a (Q(s_t, a) + u(s_t, a))$$
$$u(s, a) = \frac{P(s, a)}{1 + n(s, a)}$$
This bonus $u$ decreases for nodes we have visited more often (larger $n$). $P(s, a)$ could just be $1$, or, as in the case of AlphaGo, it could be a _prior probability distribution, calculated by some other means (effectively, a prior estimate of how good a move is).

The second caveat is that we still need a good rollout policy $p_{\pi}$ for the evaluation step. This could of course assign equal probability to each move, but if we could favour good moves over clearly bad ones, MCTS would converge faster towards good moves. So the question remains open: how to we pick a good policy?

MCTS has some properties that are very nice for competitive play: it can very quickly give us a move to make (though it will initially be somewhat random), and then it proceeds to improve it. It will only ever improve a move, meaning the more time you give it to “think” the better it gets, and it can keep improving indefinitely – given unlimited time, it will converge towards optimal play (it would eventually exhaust the entire search tree). This means in a tournament setting, where we typically have a fixed amount of time to make a move, we can always use all of that time to improve our move, but we also don’t risk going past the time limit without having a move at all.

AlphaGo’s algorithm

AlphaGo is build on a variant of MCTS, which the authors call “Asynchronous Policy and Value MCTS”. It is basically MCTS on steroids, combined with deep learning for the policy and value functions.

Combining tree search with deep learning

The DeepMind team used deep convolutional neural networks for the policies and value functions. These have proven themselves extremely effective in computer vision, for tasks like image classification, object recognition etc.

AlphaGo's training pipeline

First off, they trained two networks by supervised learning, based on a big database of recorded human expert games. These were the Rollout policy $p_\pi$ and the Supervised Learning policy $p_\sigma$. They both have the same objective: given as input the state of the board, they should predict the move of the expert player. The prediction comes in the form of a probability distribution over the possible moves – i.e., a policy. The difference between the two networks is in size: the rollout policy network $p_\pi$ is much smaller than the SL policy network $p_\sigma$. This makes it much less accurate (about 24% accuracy versus the larger network’s 57%), but is also much faster to evaluate (just $2\mu s$ versus $3$ms). This difference is crucial, because the rollout policy will be evaluated much more often than the SL policy. It should be noted that that 57% predictive accuracy of the SL network is by itself a staggering achievement: the previous state-of-the-art was around 44%.

Next, they made a copy of the SL policy network $p_\sigma$, and trained it by self-play: the would pit it against itself, and use the results of the matches to further improve the playing strength of the policy. This allowed it to move past simply imitating human players, and explore the possibilities of the game. The resulting network is called the Reinforcement Learning policy network, $p_\rho$.

Finally, a fourth and final network focuses on position evaluation. Ideally, this would be the optimal value function $v^*(s)$, but since that’s not available, they instead approximate the value function for their strongest policy $p_\rho$. They do this by training a neural network $v_\theta$ to predict the winner of the games of self-play originating from the training of the RL policy network $p_\rho$. The structure of $v_\theta$ is similar to that of $p_\rho$ and $p_\sigma$, but instead of outputting a probability distribution over the possible moves, it outputs a scalar, which represents the probability of the current player winning.

Asynchronous Policy and Value MCTS

AlphaGo’s algorithm, APV-MCTS, is effectively a modified version of MCTS on steroids. It is optimized to run on a supercomputer, making use of as much hardware as is available to it. Several threads explore the tree simultaneously, and they have special penalties to the action value to discourage the multiple threads from exploring the same branch at the same time (which would cause losses in performance). Evaluations of policy and value networks are added to queues which are asynchronously processed by dedicated GPUs (which are very good at evaluating neural networks, especially large ones).
The full program used 48 CPUs and 8 GPUs. They also implemented a distributed version of the program, which used a staggering 1202 CPUs and 176 GPUs!

Results

In October 2015, AlphaGo challenged and defeated Fan Hui, the European Go champion, thus cementing its place in history as the first computer program to defeat a human professional player. By march 2016, it had improved greatly, and defeated world champion Lee Sedol 4 games to 1 in a highly publicized 5-game match which made quite the splash worldwide. Below you can see the Elo ratings of AlphaGo in its various configurations, as well as a comparison to other Go AIs, at the time of publication (January 2016).

Performance evaluation of AlphaGo

Overall, AlphaGo is an important milestone in the history of AI: it represents a major step towards generalizing the scope of AI software. The policies and value functions were learned by imitation and self-play – this means that no game-specific information was imparted onto it, and it could in principle be trained for a wide spectrum of games. In fact, DeepMind went straight on and in the following couple of years published AlphaGo Zero, a version of AlphaGo trained completely through self-play, and AlphaZero, a single AI capable of tackling Go, Chess and Shogi at the same time.

You can find the original Nature paper about AlphaGo, which I studied in writing this article, online here.

Note: the images on this page are taken directly from the original Nature paper. As such, ownership remains with the publishers of the paper.

The Eclectic Coder

Sean Bone's portfolio