Artificial Intelligence (AI) is a concept that can be sometimes difficult to grasp. What is AI? How can I show AI in action? How can I make sure that people understand how the machine actually learns? In this blog post, I will show you how to build a physical machine that can play a game against humans and learn the best moves to make as it plays.
Hexapawn: Chess simplified
Hexapawn is a simplified chess game, where two players go up against each other on a 3-by-3 chessboard with only 3 pawns each. The rules for the pawns are a subset of their moves in chess: they can step forward one space or hit a pawn of the opponent diagonally.
The win conditions for the game are simple: you win by getting a pawn to the other side of the board or by making the enemy unable to move, either by taking all their pawns or blocking their paths.
The game itself is fully deterministic, meaning that everything is determined by the initial state and the actions players choose to take: there is no randomness introduced to the system. A game with dice or shuffled face down cards, for example, will not be deterministic, as there is a random element in them. The lack of randomness makes Hexapawn easy to model and optimize. In deterministic games, there is also always a “solution”, an optimal step in each state of the game.
The game was invented by Martin Gardner, an American mathematician and an advocate for recreational mathematics, who wrote about it in his column in Scientific American.
Building the machine learning model
To construct the machine learning model, we will need 3 basic things: pen and paper (or a printer), matchboxes (24 to be exact) and coloured beads (I will use coloured M&Ms). We will create a matchbox for each possible states in the game and add a different colour for each possible action in that state. Since the game is deterministic, we can calculate or draw out exactly how many states there are where our AI will need to take an action.
We will set one thing in stone: the Human player will always start. This way, we know that our AI will need to make a move in the Second, the Fourth and the Sixth turn. If the AI does not win in turn 6, the human will win in turn 7, so there will never be a turn 8.
When drawing out the states, we have to take into account what the human player does. In turn 2, all the pawns of the machine are on the third row and the human made one of three moves: stepped forward with their pawn on column A, B or C.
Stepping forward in A or C is the same move for our algorithm, only mirrored, so we will consider them one move. Seeing that these two moves are symmetric, we can already make the conclusion that there will be multiple states which are symmetric, so we can use this knowledge to reduce the number of matchboxes (computation time) we need. This nets us with 2 different states in turn 2:
Now we need to consider all the moves our machine can take. It can move forward with either of the unblocked pawns or hit the pawn that stepped too close. Again, mirrored moves count as one, but feel free to draw them out if you want to. We will have a total of 5 options of movement in turn 2: 2 in the first state and 3 in the second state:
Following this process, we can map out turns 4 and 6 as well. A printable state-action space, or what moves are possible in which state, will look like this:
Once you have all the states (or printed the image), stick each one on a matchbox and put the corresponding coloured beads (or M&Ms) in them. Now we are ready to play!
How does the Algorithm work?
Make the first move, and then look at the board and select the matchbox that fits the current state of the board. Shake the box, close your eyes, open the box and take out a bead. Close the box, open your eyes and place the bead on top of the box. Now make the move the colour of the bead corresponds to (e.g.: if in the first state you got a red bead out of the matchbox, you need to step forward with the piece in the B column). Make the next move as a human and repeat the process for the machine.
Once a game ends, if the machine won, mark it down and put all the beads back in their corresponding boxes. If you won, mark it down, and put all but the last bead back in the boxes, and discard the last bead. Keep repeating this process and watch as the machine evolves. If you ever reach a state where the corresponding box is empty, you automatically won. Discard the bead that led to this state. You will see the machine getting better and better to the point of becoming unbeatable.
Why does this work? Well, our Hexapawn Machine uses reinforcement learning to figure out what is the best action to take in any given state. I already wrote about reinforcement learning when discussing the AWS Deepracer. In principle, during the learning process we “punish” the machine each time it makes a mistake by taking a bead away from him, making him less likely to make a mistake. Another possible solution would be to reward the machine whenever it makes a good play by giving it an extra bead of the same colour, making him more likely to choose the action that won before. How long did it take for the machine to get better than you? Can you think of any other ways to reinforce a better play in the machine?