Pommerman: post mortem

Last month I spent a few days, together with some friends, trying to get an entry together for the Pommerman competition at this year’s NeurIPS.

While we learnt a huge amount, we didn’t manage to get an entry together in time for the conference.

All of us were pretty new to reinforcement learning, so maybe it’s not hugely surprising that we didn’t succeed. Still, I think if we’d done things differently we may have got there in time.

Some things we managed to achieve:

  • Get the game running, and set up some basic reinforcement learning agents (specifically, DQN and PPO) that could play the game.
  • Set up a training environment on a cloud server, to which we could deploy any number of training configs and have them run 16 at a time.
  • Set up Tensorboard logging for rewards, wins, and various intermediate metrics (% frequency for each action, # of survivors in each game, etc)
  • Train hundreds of PPO and DQN agents with different hyperparameters and network architectures.
  • Set up a validation environment that outputs performance stats for trained agents acting deterministically.
  • Experiment with experience replay, different types of exploration, CNNs, dropout, and simplified features.
  • Create different reward models intended to guide the agent to various strategies.

Despite all this, we didn’t manage to train an agent which figured out how to bomb through walls and find opponents. Our most successful agents would (mostly) avoid bombs near them, but otherwise be static.

What mistakes did we make?

  • We underestimated the difficulty of the problem. We figured we could just set some stock algorithms running on the environment and they’d figure out a basic strategy which we could then iterate on, but this wasn’t the case.
  • We committed fairly early on to a library (TensorForce) that we hadn’t used before without checking how good it was, or how easy it would be to change things. So when we realised, more than halfway into the project, that we really needed to get our agents to explore more, it was really hard for us to try to debug exploration and implement new techniques.
  • We spent a lot of time setting up a cloud GPU environment, which we ended up not needing! The networks we were training were so small that it was faster to just run parallel CPU threads.
  • We didn’t try to reduce complexity or stochasticity early enough, so we didn’t really know why our agents weren’t learning
  • We (I) introduced a few very frustrating bugs! The highlight was a bug where I featurised the board for our agent, and accidentally changed the board array that all agents (and the display engine) shared. This bug manifested itself as our agent’s icon suddenly changing, and took me hours to debug.

Knowing what we know now, how would we have approached this problem?

  • Simplify the environment – start with a smaller version of the problem (e.g. 4×4 static grid, one other agent) with deterministic rules. If we can’t learn this then there’s probably no point continuing!
  • Simplify the agent to the extent where we fully understand everything that’s happening – for example, write a basic DQN agent from scratch. This would’ve made it easier to add different exploration strategies.
  • Gradually increase complexity, by increasing the grid size or stochasticity.
  • Add unit tests!

Despite our lack of success, we all learnt a lot and we’ll hopefully be back for another competition!

Pommerman: getting started

Last weekend we spent a day taking our first steps towards building a Pommerman agent.

In addition to a full game simulation environment, the team running the competition were kind enough to provide helpful documentation and some great examples to help people get started.

There are a few particularly useful things included:

  • A few example implementations of agents. One just takes random actions, another is heuristic based, and a third uses a tensorforce implementation of PPO to learn to play the game.
  • A Jupyter notebook with a few examples including a step-by-step explanation of the tensorforce PPO agent implementation. (this is probably the best place to start)
  • A visual rendering of each game simulation.

Before we get anywhere, we hit a few small stumbling blocks.

  • It took us a few attempts, installing different versions of Python, before we got TensorFlow running. Now we know that TensorFlow doesn’t support Python 3.7, or any 32-bit versions of Python.
  • The tensorforce library, which the included PPO example is based on, has been changing rapidly. Some of the calls to this library no longer worked. While the code change required was minimal, it took at least an hour of digging through tensorforce code before we knew what exactly needed to be changed. We committed a small fix to the notebook here, which now works with version 0.4.3 of tensorforce, available through pip. (I wouldn’t recommend using the latest version of tensorforce on GitHub as we encountered a few bugs when trying that)

I was hoping we’d get to an agent which could beat the heuristics-based SimpleAgent at FFA, but we didn’t manage to get there. In the end, we managed to:

  • Get the Jupyter notebook with examples running
  • Understand how the basic tensorforce PPO agent works
  • Set up a validation mechanism for running multiple episodes with different ages, and save each game so we can replay it for debugging purposes.
  • Train a tensorforce PPO agent (though it was technically training, we didn’t actually manage to get it to beat the SimpleAgent in any games yet)

To be continued…

Pommerman: relevant research

As part of the NIPS 2018 Pommerman challenge, we’ll have to build bots that are able to plan and cooperate against a common enemy. The challenge docs include some links to relevant research, which I’m aiming to summarise here.

I’ve broken the papers into three sections:

  1. Planning – the fundamental skill of coming up with a strategy and choosing actions that maximise the probability of winning. The field of reinforcement learning has a wealth of approaches for this.
  2. Cooperation – planning in the presence of other agents with the same goal and possibly known architecture/behaviour.
  3. Opponent modelling – planning in the presence of other agents with opposing goals and unknown behaviour.

Planning/reinforcement learning

Proximal Policy Optimisation (PPO) (2017) is a type of reinforcement learning technique developed by OpenAI that appears to be better at generalising to new tasks than older reinforcement learning techniques, and requires less hyperparameter tuning. (in contrast, techniques like DQN can perform very well once adapted to a problem, but will be useless unless the right hyperparameters are chosen)

Monte Carlo Tree Search (2012) gives an extensive overview of Monte Carlo Tree Search (MCTS) methods in various domains, as well as describing extensions for multi-player scenarios. MCTS is a method for building a reduced decision tree, selectively looking multiple moves ahead before deciding on an action.

Monte Carlo Tree Search and Reinforcement Learning (2017) reviews methods combining MCTS and other reinforcement learning techniques. The biggest success story so far is DeepMind’s AlphaGo, which managed to beat all previous Go playing algorithms as well as the best human players, for the first time ever, by combining MCTS with deep neural networks.

Deep Reinforcement Learning from Self-Play in Imperfect-Information Games (2016) builds on Fictitious Self-Play strategies introduced in this paper, and introduces Neural Fictitious Self-Play for learning competitive strategies in imperfect-information games such as poker, where DQN does not reliably converge.

Cooperation/multi-agent learning

Multi-Agent DDPG is a technique developed by OpenAI, based on the Deep Deterministic Policy Gradient technique, where agents learn a centralised critic based on the observations and actions of all agents. The researchers found this technique to outperform traditional RL algorithms (DQN/DDPG/TRPO) on various multi-agent environments.

Cooperative Multi-Agent Learning (2005) is an overview of multi-agent learning approaches. At the highest level, it distinguishes between team learning (one learning process for the entire team) and concurrent learning (multiple concurrent learning processes).

Opponent modelling

Opponent Modeling in Deep Reinforcement Learning (2016) builds on DQN to model opponents through a Deep Reinforcement Opponent Network (DRON).

Machine Theory of Mind (2018) is a recent paper developing a system for learning to model other agents in gridworld environments, by predicting their behaviour through observation.

Coordinated Multi-Agent Imitation Learning (2018) looks at inferring the roles of other players in environments such as team sports to improve prediction of their behaviour.

Autonomous Agents Modelling Other Agents (2018) is a comprehensive survey of methods used across the machine learning literature for modelling other agents’ actions, goals, and beliefs.

Multi-agent learning with Pommerman

Together with James and Henry, I’m going to try to build two bots and enter them in the team Bomberman competition, which takes place at the beginning of December.

In a test of multi-agent learning, the two bots will face off against other bots, who they’ll try to blow up with bombs while avoiding being blown up themselves.

Our plan is:

  1. Get the basic Pommerman environment running on our laptops.
  2. Understand how the game and example agents work.
  3. Set up a way to run lots of iterations of competitions between various agents.
  4. Improve the example agents with more advanced heuristics-based play.
  5. Try out some techniques from the multi-agent learning literature, and see if we can systematically beat our heuristics-based agents.
  6. ???
  7. Submit our best team of two agents, and compete against other teams live at NIPS 2018.

Progress so far: environment installed. Example agents running. Next up: understand how they work.

Will we manage to build any agents that beat the example agents? Will our agents perform as expected on match day, or crash and freeze in live play? Will we win enough games to make it on to the leaderboard and win one of the prizes? To be continued…