1. Intro
In this project, an artificial intelligent agent was developed to compete in a board game Azul in a two-player scenario.
Azul is a zero sum, mostly
deterministic (with exceptions of start round factory display restocking),
turn-taking, multi-player game with perfect information.
In Azul, players take tile(s) from the factories to place on the pattern lines in their turn.
At the end of round, players move tiles from their complete pattern lines over to their walls and earn the scores
(
see all official rules).
To build an agent for this game, we develop a game theoretic
minimax agent utilising a heuristic function to estimate the reward of a given state.
We also explore several algorithms including a
depth first search, shallow Greedy search, Monte Carlo Tree Search.
The following sections will introduce the details of each algorithm,
how we build the tournament agent, evaluation of performance, and discussion of our agent in the Azul framework.
2. Attempted Agents
Figure1. Brief Overview of all Agents (We will introduce Greedy, MCTS, Minimax in the following)
2.1 Simple Greedy Search
As stated in the depth first search section,
blind search was not a good approach to take when creating an agent to play a game of azul due to the enormous state space.
In order to utilise domain knowledge, a simple greedy agent was implemented.
Given a game state and a set of legal moves,
a heuristic function H (s, a) a function that returned a real number was designed,
where s is a game state, and a is a legal action. The agent would then play a move via the following policy:
2.1.1 Heuristic function
This heuristic function attempted to estimate how ‘good’ a move was to make.
The heuristic function applied a move a to state s,
and scored that move using the game states score function to simulate an end of round given that game state and return the player score,
thus returning the score of the move a.
This heuristic function also took into consideration the number of tiles a player would take from the tile factory,
scoring moves that take more tiles from a factory higher than taking less moves from the factory.
2.1.2 Results
Over 1000 games this greedy player (using heuristic v.1.0) had a win rate of 82.20% versing the naive_player.
As a result, this agent was used as the baseline agent for testing the other approaches.
A downside to this approach is that it only takes into consideration its current move and does not look ahead to help guide its decision.
2.2 Monte Carlo Tree Search (UCT)
2.2.1 State space reasoning
When choosing approaches,
it was important to consider the state and action space of Azul.
If we just consider the factory displays -- in a two player game of azul there are 5 factory displays with upto four tiles on each display.
20 tiles from a tile bag initially consisting of 100 tiles (20 of each colour) are picked and distributed on each of the 5 factory displays.
Using the formula for combinations with repetitions,
there are 70 different combinations of picking 4 tiles using 5 colours.
Thus choosing 5 factory displays from 70 different unique combinations leads to 16108764 different possible initial states.
If we just consider one start configuration of an azul game with empty wall lines.
Let us assume that each factory tile contains 4 tiles of different colours.
Thus there are 20 different ways to choose a tile from the factory displays.
Since there are 5 wall lines and the floor line to place the tile into,
thus there are 6 different ways to play each tile - therefore there are 120 actions that the player can play.
Suppose that we apply one action, and the resulting game state has 80 legal actions to play.
Thus if we just consider these two moves there are a total of 120x80 different game states leading to 9600 game states with just two moves.
Thus the state space of azul explodes very fast.
Thus doing a full tree search across AZUL that spans multiple rounds would take too long to compute.
Thus the search technique had to be smart in the way it would pick which moves to explore.
It was decided that a simulation based Monte Carlo Tree Search would be implemented in order to try pick a feasible move to make.
Furthermore the strict evaluation time limit was not a problem with MCTS as it is an anytime algorithm.
2.2.2 MCTS Implementation
The variant of MCTS implemented was the Upper Confidence Trees algorithm (Brown, C.B et al. 2012) using UCB-1 to guide exploration and exploitation of the game tree.
At the end of a playout the reward of the player for that round,
or a large constant multiple of the difference in scores between our player and the opponent was returned if it was the end of the game.
MCTS was implemented using a full game roll out over 100 games against the naive_player,
a win rate of 38% was achieved.
Due to the stochasticity introduced due to the randomisation of factory displays,
each roll out would have different factory display configurations thus leading to a wide variability in the games that were being played.
There was a very low chance that a roll out would produce two games consisting of the same factory display configurations.
We thought that MCTS was not able to converge to a reasonable reward value.
Thus in order to try and increase the win rate,
roll outs were terminated at the end of the current round thus MCTS would only be able to predict moves based on the current round.
As a result, a win rate of 73% with an average score of against the naive agent was observed over 100 games.
The performance of MCTS is heavily dependent on the number of simulations it is able to do.
MCTS was able to expand between 200-600 nodes/sec. Considering the state space of AZUL,
MCTS is not exploring enough of the tree.
A significant bottleneck with this implementation of MCTS and the implementation of the game model
is deepcopying of game states to produce child nodes - this is a slow process that significantly impacts the number of nodes that we can expand.
The following table shows the evaluation of our MCTS agent against 3 agents using different time evaluations.
| Evaluation Time |
Random Agent |
Naive Agent |
Greedy Agent |
| 200ms |
100% |
42% |
12% |
| 500ms |
100% |
63% |
19% |
| 1000ms |
100% |
73% |
27% |
Table1. Win rate of MCTS against different agents across various time constraint
MCTS is capable of producing relatively well informed actions that can beat up to the naive player.
But where it breaks it when it verses a purely greedy player.
As we increase the evaluation time,
the win rate against the Greedy Player does increase,
but it is far from acceptable.
Unlike minimax MCTS does not assume that the opponent is behaving rationally,
thus this cannot be attributed to this result.
It must be that the number of states that MCTS is evaluating is not enough for it to make a confident decision as to what move it should make.
Furthermore as simulations are only going to the end of the round,
the MCTS agent cannot infer moves past the current round it is playing.
To try to increase the performance of the agent,
a lightweight rollout was then tried,
utilising the naive_player policy for the opponent agent.
As a result the win rate of the MCTS agent increased to 30% using this rollout.
2.3 Final Agent - (ɑ-β)-Minimax with heavy move pruning and move ordering
2.3.1 Minimax description
Azul is a mostly deterministic (except for round initiations) two player adversarial game with perfect information.
Thus to leverage this, a minimax agent was developed.
The minimax algorithm is a game theoretic algorithm that assumes that the opponent is also playing rationally.
Here rationally is defined as an agent chooses a move which maximises their reward given that the opponent is trying to minimise the agents reward.
The minimax algorithm searched over the same state space as described in 2.1.1.
As a result, minimax inherently takes into consideration opponents' moves unlike the greedy agent we designed.
Pure minimiax is an exhaustive tree search algorithm having exponential time complexity hence is not suitable for azul.
Instead of building an entire game tree, the state space was limited to the end of the current round.
Thus making each search fully deterministic as we would not need to consider randomised factory displays on new rounds.
2.3.2 Optimisations
An optimization technique called alpha-beta pruning (Knuth, D. E., et al., 1975) was implemented --
the main idea is that we propagate/update the lower bound and upper bound utilities into relevant nodes,
thus pruning future nodes that would not affect the result. In the best case, the time complexity of alpha-beta is O(b^(d/2))
and the average case is O(b^(3d/4)) (Russell, S., & Norvig, P., 2002).
This best case scenario refers to having perfect move ordering (using an oracle to infer minimax values for that state) when exploring states.
Furthermore a depth cutoff variant was implemented,
whereby if a node was at the depth cutoff, the ‘goodness’ of that state was estimated using a utility function.
Even with these optimisations and heuristics, the agent was not able to perform move evaluations to a depth of 2 within the given time frame.
Thus the following changes were made -- first we ordered the set of actions using the utility function to score each move,
and then we greedily took the top n number of moves to search with at each depth.
Hence trying to approximate move ordering as well as heavy pruning to significantly decrease the width of each level in the tree, similar to doing a beam search.
Thus the final minimax agent was not a purely optimal agent as we were greedy in the choice of moves we wanted to expand,
thereby discounting a proportion of the search tree where the real optimal minimax value could lie in.
Thus our model has two hyperparameters -the depth d we want to explore,
and the width w representing the top w moves we want to explore at any given state.
The width and depth of the agent was determined experimentally,
by choosing the values of w and d that give us the best win rate against the greedy agent.
Our final agent was set to explore to a depth of 4 moves using a maximum move width of 5.
2.3.3 Utility function
The performance of the minimax function is determined by the utility function.
Initially the utility function used by the minimax agent utilised the heuristic function used in the greedy player.
Essentially the utility function attempted to score each state just as if that was the final state in the round.
For example, given all empty wall lines and all empty pattern lines,
if a player made a choice that would fill an entire wall line,
the utility function would return a value of 1 as it would result in placing one tile on the wall.
By doing it this way, the utility function prefers placing tiles that would fill up an entire pattern line,
and also prefer fill lines that correspond to placing tiles next to existing tiles in the wall line.
This utility function served as a baseline utility function for our minimax agent.
2.3.4 Utility function design
Several utility functions were tried and tested in this agent but all failed to outperform the baseline utility function.
As this utility function involved deepcopying the game state, the utility function was inherently slow.
Thus as the algorithm was exponential in time complexity,
this would result in an exponential amount of deep copy calls being called thus introducing a big bottle neck to minimax.
Thus an improvement to the baseline was attempted,
this time instead of deepcopying the state it would try to infer the score of a move without deepcopying.
Furthermore, attempts to capture end game scoring bonuses also were taken into consideration as extra features.
On average this heuristic function would evaluate a game state 20 times faster than that of the baseline.
Despite this it would result in a 40% win rate against the greedy player thus failing to outperform the baseline heuristic which had a win rate of 63%.
This could be attributed to sub-optimal weight assignments in the function (placing too much weight on the future/bonus reward features rather than our immediate reward).
| Version |
Heuristic v1 |
Heuristic v2 |
| Time |
0.00139s |
0.00006412 |
Table2. Test of Computing Time for Each Heuristic
2.3.4 Utility function discussion
3.3.2 Utility function discussion Designing the utility function for minimax was deemed to be the most challenging part.
Not only did it have to find strong features for azul, the ones that we used did not seem to either be hard to comprehend in code,
difficult/slow to compute or we could not optimise the weights of the function well enough for it to perform well against the baseline.
Overall fine tuning the weights of the minimax utility function was done experimentally by letting the agent perform multiple games against the greedy agent.
This was not an efficient way of adjusting utility weights. We could have used temporal difference learning such as td-leaf (Baxter, J., et al., 1999).
3. Result
Using the baseline utility function,
the following table outlines the winrate of the minimax agent against the other agents either developed by us or already provided.
| vs. |
Random Agent |
Naive Agent |
Greedy Agent |
MCTS (UCT) |
| Minimax |
100% |
81% |
63% |
70% |
Table3. Win Rate of Minimax Agent Against Others
Here we can see that this agent outperforms all other agents that have been developed so far,
winning the majority of games against each agent. Because of this,
this agent was chosen as the final agent to be developed to play the game.
As the table shows,
our final minimax agent was able to beat both the greedy agent as well as the mcts agent.
Both the random and naive agents were also outperformed.
Section 4 outlines the evaluation of our agent against other AI agents and staff created agents.
4. Tournament
Figure2. tournament result (Rank 8 out of 64)
Reference
-
Baxter, J., Tridgell, A., & Weaver, L. (1999). Tdleaf (lambda): Combining temporal difference learning with game-tree search. arXiv preprint cs/9901001.
-
Browne, Cameron B., Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. "A survey of monte carlo tree search methods." IEEE Transactions on Computational Intelligence and AI in games 4, no. 1 (2012): 1-43.
-
Knuth, D. E., & Moore, R. W. (1975). An analysis of alpha-beta pruning. Artificial intelligence, 6(4), 293-326.
-
Russell, S., & Norvig, P. (2002). Artificial intelligence: a modern approach.