Monte Carlo Tree Search in NetLogo

Última modificación: 18 de Noviembre de 2020, y ha tenido 504 vistas

Etiquetas utilizadas: || || || ||

This post is the continuation of a previous post (in spanish) about the fundamentals of Monte Carlo Tree Search method for solving adversarial games. Maybe it is a good idea to start with it before reading this one (or any other source where this background is explained).

You can find all the code in the Adversarial Search folder from NetLogo-IA Project.

To implement Monte Carlo Tree Search (MCTS) in NetLogo we will need two main structures: states (with rules or actions) that reflect every configuration of the game/process, and nodes that will allow to build the tree game.

Although nodes will have a very specific structure, because they will apply directly the MCTS algorithm, states only have some few requirements, and their inner structure will depend on the game features. Indeed, here we will use only a kind of API providing methods to access, modify and create states (and rules/actions):

• state: it will be any structure that works with this interface:
get-content s         : get the content of the state s (game configuration)get-playerJustMoved s : player (1 or 2) that has generated this state screate-state c p      : returns a new state with content c made by player p
• rules: structure that allows to execute actions on states. It works with this interface:
apply r s      : returns a new state, the result to apply the rule r over the state sget-rules s    : returns a list with the rules applicable to state sget-result s p : returns the result of the game in state s from the point of view of player p

Note: In the implementation all the procedures are prefixed by MCTS:.

We will show some examples of how to correctly define this structure at the end of this post, when building real game players for some games by using MCTS algorithm.

The data structure that we will use to store the nodes of the tree built by MCTS algorithm will be:

• MCTSnode: agent that stores the nodes structure of the MCTS tree for the execution of the algorithm.

It has the following inner structure:

state       : state stored in the nodewins        : accumulated value of the reward got by the tree from this nodevisits      : number of plays in the tree that uses this nodeuntriedRules: rules applicable to the state that have not been already used to generate children

And works with the following interface:

create-MCTSnode s : returns a MCTSnode as root with state=s, wins=0, visits=0, unTriedRules=all applicable rules.

to-report create-MCTSnode [s]  let rep nobody  create-MCTSnodes 1 [    set state s    set wins 0    set visits 0    set untriedRules get-rules s    set rep self    set shape "circle"    set color blue    set label (word state " W/V:" wins "/" visits)    ht  ]  report repend

UCTSelectChild : returns the child of the node with best UCT.

to-report UCTSelectChild  let v visits  let child max-one-of out-link-neighbors [(wins / visits) + sqrt(2 * (ln v) / visits)]  report childend

As we have used $\sqrt{2}$ as constant to balance exploration and explotation, this UCT function will work fine when the rewards (wins) from the terminal states are in $[0,1]$.

AddChild r : creates and returns a new child by applying the rule r to the current node.

to-report AddChild [r]  let rep nobody  hatch-MCTSnodes 1 [    set state apply r state    set wins 0    set visits 0    set untriedRules get-rules state    create-MCTSrule-from myself [      set rule r      set label rule      hide-link    ]    set rep self    set label state  ]    set untriedRules remove r untriedRules  report repend

Update result : updates the values of the node taking result as reward.

to Update [result]  set visits visits + 1  set wins wins + resultend

parent : returns the parent of the current node.

to-report parent  report one-of in-MCTSrule-neighborsend

Once we have defined the auxiliary data structures and the interface to manipulate them, we can program the main procedure, UCT, that will return the best action to take on the current state to maximize our reward.

UCT root-state N

Following MCTS, this procedure will repeat N times from current state (the root node of the tree) the next steps:

1. Select a node in the current tree to be expanded.
2. Expand the selected node.
3. Simulate(Rollout) one complete game from the current expanded node.
4. Propagate the result of this game to all the involved nodes in the tree.

Finally, returns the best child of the root following the results obtained in the simulations.

The implementation in NetLogo can be as follows:

to-report UCT [root-state iter]  ask MCTSnodes [die]  ; Create the root node (with the input state)  let rootnode create-MCTSnode root-state   ;Main loop for iter iterations  repeat iter [    ; Start again from the root node    let cnode rootnode    ; Selection stage:     ; Using UCT selection. we go from the root node to a current leaf that     ; has to be expanded    while [empty? [untriedRules] of cnode and any? [my-out-MCTSrules] of cnode]    [      set cnode [UCTSelectChild] of cnode    ]    ; Expand stage:     ; Expand the previous node with one child (any of them)    ; if we can expand (i.e. state/node is non-terminal)    if not empty? [untriedRules] of cnode     [      ask cnode [        let r one-of untriedRules        set cnode AddChild r        ]    ]        ; Simulation/Rollout stage:     ; From this new node we play a random complete game (we use only states,    ; not nodes, because we only need to know the final state and who is the     ; winner)    let s [state] of cnode    while [not empty? get-rules s]    [      let r one-of (get-rules s)      set s (apply r s)    ]    ; Backpropagate stage:     ; Update the nodes from the expanded one to the root    while [cnode != nobody]     [      ask cnode [        Update get-result s (get-playerJustMoved state)        set label (word state " W/V:" wins "/" visits)        set cnode parent        ]    ]  ]  ; Return the best found rule  let rep nobody  ask rootnode [    ; return the move that was most visited    set rep max-one-of out-MCTSrule-neighbors [visits]  ]  report [rule] of MCTSrule [who] of rootnode [who] of repend

By default, the tree is hidden... you can show it adding some simple instructions to see how it evolves.

Some examples

Nim

Nim is a mathematical strategy game where two players take turns removing (i.e., nimming) objects/chips from one heap or pile (there are also variants with several heaps). On every turn, a player can remove from 1 to 3 objects. The winner is who removes the last chip.

Although there exists a winner strategy for this game, and it is possible to provide a perfect heuristic to play it, but we will ignore it and will prepare the state structure in order to play a blind MCTS:

The states will be given by a list with two numbers:

[content player]

In this case, the content is the number of chips in the heap, and the player will be 1 or 2.

The rules are numbers... the number of chips to remove.

The interface to manage correctly these states are:

; Get the content of the stateto-report get-content [s]  report first send
; Get the player that generates the stateto-report get-playerJustMoved [s]  report last send
; Create a state from the content and playerto-report create-state [c p]  report (list c p)end
; Get the rules applicable to the stateto-report get-rules [s]  report (range 1 (min (list 4 (1 + get-content s))))end
; Apply the rule r to the state sto-report apply [r s]  let c get-content s  let p get-playerJustMoved s  report create-state (c - r) (3 - p)end
; Move the result from the last state to the current oneto-report get-result [s p]  let pl get-playerJustMoved s  ifelse pl = p [report 1] [report 0]end

Indeed, it will be very common to use a very similar structure as this one, and in a lot of cases we will have to only adapt the procedures get-rules, apply and get-result to the features of our game.

After that, preparing a main procedure to play against the computer is as easy as the following one:

; A function that removes n chips (for human player). After that, it; tests if the game is over and, otherwise, let the computer play (with UCT).to remove-chips [n]  if n <= num_chips [    set num_chips num_chips - n  ]  ifelse num_chips = 0  [ user-message "You Win!!!"]  [ let m UCT (list num_chips 1) Max_iterations    user-message (word "I remove " m)    set num_chips num_chips - m    if num_chips = 0 [user-message "I Win!!!"]  ]end

where num_chips is a global variable where the current number of chips in the heap is stored, and Max_iterations is the number of iterations that UCT algorithm will perform to search for the optimal move.

Tic Tac Toe

In this case the states will keep the same structure as the previous example, but storing in the content a list with the values of the board [$p_0$ $p_1$ $p_2$ $p_3$ $p_4$ $p_5$ $p_6$ $p_7$ $p_8$], where $p_i$ can be $0$ (no piece), $1$ (piece from player 1), or $2$ (piece from player 2):

p0 p1 p2p3 p4 p5p6 p7 p8

The rules are positions (from 0 to 9) where the player put his piece.

The procedures that we must be adapted for these structures are:

; Get the rules applicable to the stateto-report get-rules [s]  let c get-content s  ; Filter the empty places in the board  report filter [x -> item x c = 0] (range 0 9)end
; Apply the rule r to the state sto-report apply [r s]  let c get-content s  let p get-playerJustMoved s  ; Fill the r place with the number of the current player  report create-state (replace-item r c (3 - p)) (3 - p)end
; Move the result from the last state to the current oneto-report get-result [s p]  let pl get-playerJustMoved s  let c get-content s  ; L have the 3-lines of the board  let L [ [0 1 2] [3 4 5] [6 7 8] [0 3 6] [1 4 7] [2 5 8] [0 4 8] [2 4 6] ]  ; For every line, we see if it is filled with the same player  foreach L [    lin ->    let val map [x -> (item x c)] lin    if first val = last val and first val = (first bf val) [      ifelse first val = p [report 1] [report 0]    ]  ]  ; if there is no winner lines, and the board is full, then it is a draw  if empty? get-rules s [report 0.5]  report [false]end

In order to make an interactive interface for the game, we will use a world of $3\times 3$ patches that will store the board, and the pieces will be a breed of turtles:

breed [pieces piece]
patches-own [  value ; to store the piece (0/1/2) of this place]

We need an auxiliary procedure to transform between the states (lists) and patches-board:

to-report board-to-state  report map [x -> [value] of x] (sort patches)end

Then, a procedure to start a new game:

to start  ca  ask patches [set value 0]  ask patches with [(pxcor + pycor) mod 2 = 0] [set pcolor grey]end

and a forever procedure (that is, a procedure that is executed in a forever button) to catch the interaction of the user (with the mouse) and call the actions and the UCT algorithm when needed (when the IA plays):

to go  let played? false  if mouse-down? [    ask patch mouse-xcor mouse-ycor [      if not any? pieces-here [        set value 1        sprout-pieces 1 [          set shape "o"          set color white]        set played? true      ]    ]    if get-result (list (board-to-state) 1) 1 = 1 [      user-message "You win!!!"      stop    ]    if get-result (list (board-to-state) 2) 2 = 0.5 [      user-message "Draw!!!"      stop    ]    wait .1    if played? [      let m UCT (list (board-to-state) 1) Max_iterations      ask (item m (sort patches)) [        set value 2        sprout-pieces 1 [          set shape "x"          set color white]      ]    ]    if get-result (list (board-to-state) 2) 2 = 1 [      user-message "I win!!!"      stop    ]    if get-result (list (board-to-state) 2) 2 = 0.5 [      user-message "Draw!!!"      stop    ]  ]end