As part of the NIPS 2018 Pommerman challenge, we’ll have to build bots that are able to plan and cooperate against a common enemy. The challenge docs include some links to relevant research, which I’m aiming to summarise here.
I’ve broken the papers into three sections:
- Planning – the fundamental skill of coming up with a strategy and choosing actions that maximise the probability of winning. The field of reinforcement learning has a wealth of approaches for this.
- Cooperation – planning in the presence of other agents with the same goal and possibly known architecture/behaviour.
- Opponent modelling – planning in the presence of other agents with opposing goals and unknown behaviour.
Proximal Policy Optimisation (PPO) (2017) is a type of reinforcement learning technique developed by OpenAI that appears to be better at generalising to new tasks than older reinforcement learning techniques, and requires less hyperparameter tuning. (in contrast, techniques like DQN can perform very well once adapted to a problem, but will be useless unless the right hyperparameters are chosen)
Monte Carlo Tree Search (2012) gives an extensive overview of Monte Carlo Tree Search (MCTS) methods in various domains, as well as describing extensions for multi-player scenarios. MCTS is a method for building a reduced decision tree, selectively looking multiple moves ahead before deciding on an action.
Monte Carlo Tree Search and Reinforcement Learning (2017) reviews methods combining MCTS and other reinforcement learning techniques. The biggest success story so far is DeepMind’s AlphaGo, which managed to beat all previous Go playing algorithms as well as the best human players, for the first time ever, by combining MCTS with deep neural networks.
Deep Reinforcement Learning from Self-Play in Imperfect-Information Games (2016) builds on Fictitious Self-Play strategies introduced in this paper, and introduces Neural Fictitious Self-Play for learning competitive strategies in imperfect-information games such as poker, where DQN does not reliably converge.
Multi-Agent DDPG is a technique developed by OpenAI, based on the Deep Deterministic Policy Gradient technique, where agents learn a centralised critic based on the observations and actions of all agents. The researchers found this technique to outperform traditional RL algorithms (DQN/DDPG/TRPO) on various multi-agent environments.
Cooperative Multi-Agent Learning (2005) is an overview of multi-agent learning approaches. At the highest level, it distinguishes between team learning (one learning process for the entire team) and concurrent learning (multiple concurrent learning processes).
Opponent Modeling in Deep Reinforcement Learning (2016) builds on DQN to model opponents through a Deep Reinforcement Opponent Network (DRON).
Machine Theory of Mind (2018) is a recent paper developing a system for learning to model other agents in gridworld environments, by predicting their behaviour through observation.
Coordinated Multi-Agent Imitation Learning (2018) looks at inferring the roles of other players in environments such as team sports to improve prediction of their behaviour.
Autonomous Agents Modelling Other Agents (2018) is a comprehensive survey of methods used across the machine learning literature for modelling other agents’ actions, goals, and beliefs.