Pommerman: post mortem

Last month I spent a few days, together with some friends, trying to get an entry together for the Pommerman competition at this year’s NeurIPS.

While we learnt a huge amount, we didn’t manage to get an entry together in time for the conference.

All of us were pretty new to reinforcement learning, so maybe it’s not hugely surprising that we didn’t succeed. Still, I think if we’d done things differently we may have got there in time.

Some things we managed to achieve:

Get the game running, and set up some basic reinforcement learning agents (specifically, DQN and PPO) that could play the game.
Set up a training environment on a cloud server, to which we could deploy any number of training configs and have them run 16 at a time.
Set up Tensorboard logging for rewards, wins, and various intermediate metrics (% frequency for each action, # of survivors in each game, etc)
Train hundreds of PPO and DQN agents with different hyperparameters and network architectures.
Set up a validation environment that outputs performance stats for trained agents acting deterministically.
Experiment with experience replay, different types of exploration, CNNs, dropout, and simplified features.
Create different reward models intended to guide the agent to various strategies.

Despite all this, we didn’t manage to train an agent which figured out how to bomb through walls and find opponents. Our most successful agents would (mostly) avoid bombs near them, but otherwise be static.

What mistakes did we make?

We underestimated the difficulty of the problem. We figured we could just set some stock algorithms running on the environment and they’d figure out a basic strategy which we could then iterate on, but this wasn’t the case.
We committed fairly early on to a library (TensorForce) that we hadn’t used before without checking how good it was, or how easy it would be to change things. So when we realised, more than halfway into the project, that we really needed to get our agents to explore more, it was really hard for us to try to debug exploration and implement new techniques.
We spent a lot of time setting up a cloud GPU environment, which we ended up not needing! The networks we were training were so small that it was faster to just run parallel CPU threads.
We didn’t try to reduce complexity or stochasticity early enough, so we didn’t really know why our agents weren’t learning
We (I) introduced a few very frustrating bugs! The highlight was a bug where I featurised the board for our agent, and accidentally changed the board array that all agents (and the display engine) shared. This bug manifested itself as our agent’s icon suddenly changing, and took me hours to debug.

Knowing what we know now, how would we have approached this problem?

Simplify the environment – start with a smaller version of the problem (e.g. 4×4 static grid, one other agent) with deterministic rules. If we can’t learn this then there’s probably no point continuing!
Simplify the agent to the extent where we fully understand everything that’s happening – for example, write a basic DQN agent from scratch. This would’ve made it easier to add different exploration strategies.
Gradually increase complexity, by increasing the grid size or stochasticity.
Add unit tests!

Despite our lack of success, we all learnt a lot and we’ll hopefully be back for another competition!

Harald's blog

Pommerman: post mortem

Leave a comment Cancel reply

Related

Leave a comment Cancel reply