Implementation of a reinforcement learning agent that plays rock-paper-scissors against the user (or the computer) and reports results.
Main idea
Rock-paper-scissors is a game where two players randomly choose between three possible objects where one beats the other in a circular manner, giving three possible results: player A wins, player B wins or tie. It is very often employed as a decision method to choose between two people due to the fast nature of the game and the assumtion that the result is random.
However, it is no secret that humans struggle to generate random sequences and there is usually some underlying pattern between the string of decisions, meaning that a very intelligent opponent should be able outsmart the opponent and win slightly more than 50% of the time against a human. So let’s try!
Adversarial bandits
Multi-Armed Bandits (MABs) are a typical reinforcement learning scenario, where the agent (i.e. the AI) can choose between multiple actions over several tries and, based on the consequences of those actions tries to maximize the overall reward by choosing the optimal decision every time.
Adversarial Multi-Armed Bandits are a variant of the original problem where the agent cannot assume that the rewards of every choice follow specific distributions, but rather there is an adversary that actively tries to change the rewards to make the agent fail as much as possible. This considerably hardens the task of the agent, but there are still a few techniques to deal with this problem; here I implemented the algorithm by Auer et al..
Note how the rock-paper-scissors game can be interpreted as an adversarial MAB problem, where the agent chooses between three actions but the opponent will try to choose corresponding action to make it lose.
Experiments
The program allows testing different strategies against one another. These are:
- User input (i.e. a human player).
- Always picking the same object, mostly used to debug. The agent should quickly realize about this behaviour and beat it consistently from then on.
- Random pick. Should be random enough that there is no way to get any sort of advantge and the long term result should be close to 50%.
- Expecting the user not to change. E.g. choosing rock if the opponent played scissors on the previous round. This strategy is simple enough that a human opponent should be able to realize what is going on.
- Adversarial MAB algorithm mentioned above.
Does it work?
Nope. Well, kind of.
The adversarial MAB algorithm works as expected, significantly beating the dumb strategy of picking the same object every time and being compelely equal against itself or the true random choice. However, it is not able to beat the fourth strategy (assuming the choice will not change) and ends up choosing all three possibilities evenly.
The fact that the algorithm is not able to figure out the relatively simple fourth strategy indicates that it is not even close to human level play, let alone beating it. One would need a much more complex technique, such as very deep neural networks, to be able to surpass humal-level intelligence and exploit any non-randomness in the sequence of choices.