I'm very much a kinaesthetic learner which means that I learn best by getting hands on and trying things, making mistakes and figuring out where I've gone wrong. As such, I figured the best way to learn more was to dive headfirst into a project. I'd seen this Mar/IO video a few years back and was always intrigued about how it worked. The video is fascinating and I'd thoroughly recommend you take a look. They use genetic algorithms to introduce variance into the neural network that takes agent inputs and a simplistic representation of the screen to learn how to best optimize the reward. I'm going to be using Gym-Retro and the stable baselines implementation of the PPO algorithm for my use case in contrast.
In order to learn more about the flavors of reinforcement learning we'll be using in this and subsequent posts, start with Part one of this blog post series. To get your RL environment set up to run some examples, see Part two.
I decided to start small with my project, focusing on one of the most popular games on the planet, that has been ported to basically every device known to man at this point.
I figured that Tetris was a sufficiently simple game that I should be able to implement and train an agent with a minimum of fuss. I was very wrong.
The goal of Tetris could not be more simple. Blocks drop down the screen and the goal is to make and clear lines, which give you points. The more lines you make, the quicker the blocks drop and eventually the screen fills up and you die.
My first attempt at creating an intelligent RL agent for Tetris used a combination of score and lines completed as a reward function. Unfortunately, Tetris exhibits a behavior that is undesirable for model-free RL agents, and that is a sparse reward function. What this means is that there are typically a lot of moves you need to make before a reward signal is received. This makes it difficult for the algorithm to isolate what actions lead to what rewards. As such, the agent learned a bunch of strange behaviors, like rapidly stacking blocks down the left hand side (farming score gain from quick-drop). I'd like to revisit this given everything I've learned in implementing subsequent use cases, but I abandoned this one due to the sparse reward function issue.
What I didn't know is that Tetris as a game has actually been more or less solved in a deterministic way. For every board configuration and new tile there is an optimal move that should be made, based on 4 pieces of information. This blog post highlights everything you'd need to know, but in summary, we want the move that:
Minimizes the maximum aggregate height of blocks already in play
Maximizes the number of completed lines
Minimizes the number of holes in the board
Minimizes the "bumpiness" of the board (differences in height across columns)
I believe that crafting a reward function that incorporates these metrics might yield better results.
The next post documents some of my work in switching gears to NHL94 and F-Zero, with some very mixed results.