« PageRank y el Surfist… « || Inicio ||

Monte Carlo Tree Search in NetLogo

Última modificación: 8 de Agosto de 2019, y ha tenido 28 vistas

Etiquetas utilizadas: || || || ||

This post is the continuation of a previous post (in spanish) about the fundamentals of Monte Carlo Tree Search method for solving adversarial games. I recommend to start with it before to read this one (or any other source where this background is explained).

You can find all the code in the Adversarial Search folder from NetLogo-IA Project.

To implement Monte Carlo Tree Search (MCTS) in NetLogo we will need two main structures: states (with rules or actions) that reflect every configuration of the game/process, and nodes that will allow to build the tree game.

Although nodes will have a very specific structure, because they will apply directly the MCTS algorithm, states only have some few requirements, and it will be the game what will decide their inner structure. Indeed, for this implementation, we will prepare only a kind of API that will provide methods to access, modify and create states (we will need one more structure to manage rules/actions):

  • state: will be any structure that works with this interface:
get-content s         : get the content of the state s (game configuration)
get-playerJustMoved s : player (1 or 2) that has generated this state s
create-state c p : returns a state with provided information
  • rules: structure that allows to execute actions on states. It works with this interface:
apply r s      : returns a new state, the result to apply the rule r over the state s
get-rules s : returns a list with the rules applicable to state s
get-result s p : returns the result of the game in state s from the point of view of player p

We will show some examples of how to correctly define this structure at the end of this post, when building real game players from MCTS algorithm.

The data structure that we will use to store the nodes of the tree built by MCTS algorithm will be:

  • MCTSnode: agent that stores the nodes structure of the MCTS tree for the execution of the algorithm

It has the following inner structure:

state       : state stored in the node
wins : accumulated value of the reward got by the tree from this node
visits : number of games in the tree that uses this node
untriedRules: rules applicable to the state that have not been already used to generate children

And works with the following interface:

create-MCTSnode s : returns a MCTSnode as root with state=s, wins=0, visits=0, unTriedRules=all applicable rules

to-report create-MCTSnode [s]
let rep nobody
create-MCTSnodes 1 [
set state s
set wins 0
set visits 0
set untriedRules get-rules s
set rep self
set shape "circle"
set color blue
set label (word state " W/V:" wins "/" visits)
ht
]
report rep
end

UCTSelectChild : returns the child of the node with best UCT

to-report UCTSelectChild
let v visits
let child max-one-of out-link-neighbors [(wins / visits) + sqrt(2 * (ln v) / visits)]
report child
end

As we have used $\sqrt{2}$ as constant to balance exploration and explotation, this UCT function will work fin when the reward (wins) from the terminal states are in $[0,1]$.

AddChild r : creates and returns a new child by applying the rule r to the current node

to-report AddChild [r]
let rep nobody
hatch-MCTSnodes 1 [
set state apply r state
set wins 0
set visits 0
set untriedRules get-rules state
create-MCTSrule-from myself [
set rule r
set label rule
hide-link
]
set rep self
set label state
]
set untriedRules remove r untriedRules
report rep
end

Update result : updates the values of the node considering a reward of result

to Update [result]
set visits visits + 1
set wins wins + result
end

parent : returns the parent of the current node

to-report parent
report one-of in-MCTSrule-neighbors
end

Once we have defined the auxiliary data structures and API to manipulate them, we can program the main procedure, UCT, that will return the best action to take on the current state to maximize our reward.

UCT root-state N

From current state (the root node of the tree), repeat $N$ times:

  1. Select a node in the current tree to be expanded
  2. Expand the selected node
  3. Simulate(Rollout) one complete game from the current expanded node
  4. Propagate the result of this game to all the involved nodes in the tree

Finally, returns the best child of the root following the results obtained in the simulations.

The implementation in NetLogo can be as follows:

to-report UCT [root-state iter]
ask MCTSnodes [die]
; Create the root node (with the input state)
let rootnode create-MCTSnode root-state

;Main loop for iter iterations
repeat iter [
; Start again from the root node
let cnode rootnode

; Selection stage: 
; Using UCT selection. we go from the root node to a current leaf that 
; has to be expanded

while [empty? [untriedRules] of cnode and any? [my-out-MCTSrules] of cnode]
[
set cnode [UCTSelectChild] of cnode
]

; Expand stage: 
; Expand the previous node with one child (any of them)

; if we can expand (i.e. state/node is non-terminal)
if not empty? [untriedRules] of cnode 
[
ask cnode [
let r one-of untriedRules
set cnode AddChild r
]
]

; Simulation/Rollout stage: 
; From this new node we play a random complete game (we use only states,
; not nodes, because we only need to know the final state and who is the 
; winner)

let s [state] of cnode
while [not empty? get-rules s]
[
let r one-of (get-rules s)
set s (apply r s)
]

; Backpropagate stage: 
; Update the nodes from the expanded one to the root

while [cnode != nobody] 
[
ask cnode [
Update get-result s (get-playerJustMoved state)
set label (word state " W/V:" wins "/" visits)
set cnode parent
]
]
]

; Return the best found rule

let rep nobody
ask rootnode [
; return the move that was most visited
set rep max-one-of out-MCTSrule-neighbors [visits]
]
report [rule] of MCTSrule [who] of rootnode [who] of rep
end

By default, the tree is hidden... you can show it adding some simple instructions to see how it evolves.

Some examples

Nim

Nim is a mathematical game of strategy in which two players take turns removing (i.e., nimming) objects from one heap or pile (there are variants with several heaps). On each turn, a player must remove from one to three objects. The goal of the game is to take the last object.

Although there exists a winner strategy for this game, and it is possible to provide a perfect heuristic to play this game, we will prepare the state structure in order to play a blind MCTS:

The states will be given by a list with two numbers:

[content player]

In this case, the content is the number of chips in the heap, and the player will be 1 or 2.

Now, we must prepare the interface to manage correctly these states:

; Get the content of the state
to-report get-content [s]
report first s
end
; Get the player that generates the state
to-report get-playerJustMoved [s]
report last s
end
; Create a state from the content and player
to-report create-state [c p]
report (list c p)
end
; Get the rules applicable to the state
to-report get-rules [s]
report (range 1 (min (list 4 (1 + get-content s))))
end
; Apply the rule r to the state s
to-report apply [r s]
let c get-content s
let p get-playerJustMoved s
report create-state (c - r) (3 - p)
end
; Move the result from the last state to the current one
to-report get-result [s p]
let pl get-playerJustMoved s
ifelse pl = p [report 1] [report 0]
end

Indeed, it will be very common to use a very similar structure as this one, and in a lot of cases we will have to only adapt the procedures get-rules, apply and get-result to the features of our game.

After that, preparing a main procedure to play against the computer is as easy as the following one:

; A function that removes n chips (for human player). After that, it
; tests if the game is over and, otherwise, let play the computer (with UCT).
to remove-chips [n]
if n <= num_chips [
set num_chips num_chips - n
]
ifelse num_chips = 0
[ user-message "You Win!!!"]
[ let m UCT (list num_chips 1) Max_iterations
user-message (word "I remove " m)
set num_chips num_chips - m
if num_chips = 0 [user-message "I Win!!!"]
]
end

where num_chips is a global variable where the current number of chips in the heap is stored, and *Max_iterations* is the number of iterations that we will let the UCT algorithm to search for the optimal move.

Tic Tac Toe

In this case the states will keep the same structure as the previous example, but storing in the content a list with the values of the board [$p_0$ $p_1$ $p_2$ $p_3$ $p_4$ $p_5$ $p_6$ $p_7$ $p_8$], where $p_i$ can be $0$ (no piece), $1$ (piece from player 1), or $2$ (piece from player 2):

p0 p1 p2
p3 p4 p5
p6 p7 p8

The procedures that we must adapt for this structure are:

; Get the rules applicable to the state
to-report get-rules [s]
let c get-content s
; Filter the empty places in the board
report filter [x -> item x c = 0] (range 0 9)
end
; Apply the rule r to the state s
to-report apply [r s]
let c get-content s
let p get-playerJustMoved s
; Fill the r place with the number of the current player
report create-state (replace-item r c (3 - p)) (3 - p)
end
; Move the result from the last state to the current one
to-report get-result [s p]
let pl get-playerJustMoved s
let c get-content s
; L will have the lines of the board
let L 0
; For every line, we see if it is filled with the same player
foreach L [
lin ->
let val map [x -> (item x c)] lin
if first val = last val and first val = (first bf val) [
ifelse first val = p [report 1] [report 0]
]
]
; if there is no winner lines, and the board is full, then it is a draw
if empty? get-rules s [report 0.5]
report [false]
end

For the interface we use a world of 3x3 patches that will store the different pieces (the pieces will be turtles also):

breed [pieces piece]
patches-own [
value ; to store the piece (0/1/2) of this place
]

We need an auxiliary procedure to obtain the board (as a list) from the world:

to-report board-to-state
report map [x -> [value] of x] (sort patches)
end

Then, a procedure to start a new game:

to start
ca
ask patches [set value 0]
ask patches with [(pxcor + pycor) mod 2 = 0] [set pcolor grey]
end

and a forever procedure (that is, a procedure that is executed in a forever button), to catch the move of the user (with the mouse over the world) and call the actions and de UCT algorithm when needed:

to go
let played? false
if mouse-down? [
ask patch mouse-xcor mouse-ycor [
if not any? pieces-here [
set value 1
sprout-pieces 1 [
set shape "o"
set color white]
set played? true
]
]
if get-result (list (board-to-state) 1) 1 = 1 [
user-message "You win!!!"
stop
]
if get-result (list (board-to-state) 2) 2 = 0.5 [
user-message "Draw!!!"
stop
]
wait .1
if played? [
let m UCT (list (board-to-state) 1) MAx_iterations
ask (item m (sort patches)) [
set value 2
sprout-pieces 1 [
set shape "x"
set color white]
]
]
if get-result (list (board-to-state) 2) 2 = 1 [
user-message "I win!!!"
stop
]
if get-result (list (board-to-state) 2) 2 = 0.5 [
user-message "Draw!!!"
stop
]
]
end

« PageRank y el Surfist… « || Inicio ||