### The AlphaGo paper

Go is a topological counting game that has been challenging for a computer to play due to its global phase-transition like game dynamics, requiring accurate reads of position early in the game. When the AlphaGo paper (preprint) first came out in January purporting a new system that plays Go at a human-professional level, I didn’t have a chance to read it thoroughly, partly because the paper is badly written (yes, I said it), and partly because it was a lot of information. Now that the DeepMind Challenge Match against Lee Sedol is happening, with a surprise nearly every game, I thought to give it another go (no pun intended).

The main objective is to extract from the paper four key pieces of information: (1) How did computers play Go before, and what was the state-of-the-art before AlphaGo? (2) What is the novelty of AlphaGo from which it derives its improved performance? (3) How is AlphaGo actually implemented? (4) What are the future and the limitations of this approach?

**I. How did computers play Go before?**

There were two main approaches: minimax search and Monte Carlo tree search (MCTS). Suppose a value \(v(s)\) is assigned to every game state \(s\). Minimax search (to some depth) finds from all continuations of the current state the one that maximizes the terminal state value under perfect (i.e. minimax) play, and chooses the next action based on that. Suppose further that a stochastic (rather than minimax) playing policy \(p(a|s)\) is defined for every state \(s\) and action \(a\). MCTS samples an empirical value function by playing according to \(p(a|s)\) to termination and marginalizing terminal rewards over \(a\); the empirical value function is then used to update \(p(a|s)\) appropriately as playing tree weights, which becomes the actual playing policy — possibly with some modification terms. This playing policy can be considered a soft approximation of minimax. For details see this.

For the same amount of computation, MCTS has been more successful in playing Go than minimax (for obvious reasons), and some variant or another can be considered the state-of-the-art. It still suffers from inaccuracy in approximating the “true” value function due to subpar sampling or just not sampling enough of the huge continuation space. Alternatively there were other recent ideas involving modeling the actual playing policy \(p(a|s)\) directly by neural networks to mimic human play, but the results have been poor because while predictions of individual moves can be reasonably accurate (~50%), the resulting play from the mispredictions can be disastrous.

**II. What is the novelty of AlphaGo?**

The novelty of AlphaGo appears to be:

- the introduction of a state value function based on neural networks, which have more generalizing power than marginals from empirical playing samples;
- the combination of this value function with traditional MCTS and a neural-network based policy function \(p(a|s)\) that predicts human play.

In essence, AlphaGo combines two existing state-of-the-art strategies that are each poor in some way, and adds one single improvement to the approximation of the value function. This is the so-called “Asynchronous policy and value MTCS” (APV-MCTS) algorithm.

**III. How is AlphaGo implemented?**

Like certain truncated versions of MCTS, AlphaGo maintains a playing tree that begins at the current state and ends at non-terminal leaf states. Every edge transition follows a handful of statistics combining into a deterministic in-tree playing policy that chooses action \(a^*=\arg\max_a (Q(s,a)+u(s,a))\). The specifics are not important but \(Q(s,a)\) is an average reward (or “action value”) of choosing action \(a\) due to values in the subtree, and \(u(s,a)\) is some term containing an *a priori* policy \(P(s,a)\) when rewards have not been computed. In steady-state operation, \(P(s,a)=p_\sigma(a|s)\), which is not a probability but the action-classification output of a neural network taking low-level 19x19x48 board features as input. This is the so-called *supervised learning (SL) policy network*, and is relatively expensive to compute (3 ms per evaluation at the lowest setting). So before this (asynchronous GPU) computation finishes, a predictor taking 9 high-level game-pattern features, the so-called *tree policy* \(p_\tau(a|s)\), substitutes for \(P(s,a)\).

When a leaf node is reached by playing \(a^*\), two things happen. First, leaf rewards, i.e. leaf state values, are computed, firstly by traditional MCTS-style stochastic play to termination according to a *rollout policy* \(p_\pi(a|s)\) (an even smaller predictor than the tree policy, 6 features, taking 2 μs per evaluation) and secondly by direct evaluation of a *value network* \(v_\theta(s)\), which is a neural network trained by sampling continuations __using the reinforcement learning (RL) policy network__ \(p_\rho(a|s)\) to termination and marginalizing terminal rewards over \(a\). So the value network provides a more general and perhaps accurate state value, mimicking “intuition,” while the weaker rollout policy samples the specific situation at hand, mimicking “sanity checking.”

This is basically it, save for some miscellany: new computations of values and leaf visits obviously back-propagate to update the playing tree statistics, and there is some engineering tricks like caching and distributed computing, etc. The policy and value networks are convolutional networks, both having about a dozen hidden layers. The training of these networks is initialized by KGS gameplay data to generate the SL network, and updated further by de-correlated self-play samples through reinforcement learning to generate the RL network. The tree policy and the rollout policy appear to be simple linear softmax predictors, trained in the classical way, I suppose, though the authors do not say much here; these are mostly hacks to speed up computation anyway.

**IV. What is the future and what are limitations?**

The aim of the AlphaGo system is clear. MCTS was already a good starting template for building generic gameplay systems through simulation. AlphaGo adds a generalizing value network to evaluate positions it hasn’t seen, and provides a way to train it. This is obviously a step forward, but we have seen some pretty odd behavior during the matches against Lee Sedol. I’ll leave to another post some thoughts on the “strange” moves it has played, and especially its behavior in Game 4, which was a loss for AlphaGo. Suffice it to say, as far as Go goes, there is additional complexity that has not been adequately modeled by AlphaGo. It’s not even clear that reinforcement learning of a single value function will get all the way to the kind of opening- to middle-game axiomatizations that are critical to expert play.