Updated: Jun 8, 2021
This is the 5th and final instalment of my series on reinforcement learning. This post will be a lot longer than previous posts; as the 4th use case, by this time I'd finally learned enough to build a functional RL agent. Let's dive in.
In order to learn more about the flavors of reinforcement learning we'll be using in this and subsequent posts, start with Part one of this blog post series. To get your RL environment set up to run some examples, see Part two. Part three looked at a use case involving the game Tetris. Part four discussed the failures I had in training an agent to play NHL94 on Sega Genesis and F-Zero on Super Nintendo.
Super Mario Kart was the first game I owned on my Super Nintendo back in the 90's. I loved it and must have sunk thousands of hours into playing it at this point. I wanted to look at this as a use case after having minor success with F-Zero for a few reasons:
Racing games have easily defined (in theory), non-sparse reward functions based on speed and progress around the track.
Well-documented RAM maps that would make it easy for me to extract variables of interest.
Existing solutions that I could benchmark my progress against.
I love Super Mario Kart.
Having learned my lesson about difficulty in extracting variables with NHL94, I started this use case by researching whether there were documented RAM maps available. I turned up two awesome resources here and here. They helped me extract more than a dozen variables, outlined below.
Of particular relevance:
checkpoint: used to track progress around the track
flow: used to detect when the kart is driving backwards
field_position: represents rank
speed: speed of player kart
position_east/south: pixel position of player kart, relative to top left
surface_type_code: used to detect when kart was on-road vs. off-road
Once I had these variables I created a reward function that rewarded speed, checkpoint progression and staying on the track, while penalizing straying from the track and heavily penalizing falling off the track. Note that the checkpoint progression reward also worked in reverse, penalizing the kart for driving backwards.
Before even training anything I wanted to do some benchmarking. Using my reward function as defined, I manually ran a handful of races on Mario Circuit 1, 50cc using Koopa as my character, in GP mode. I was able to achieve an average normalized reward score of about 190 for the race, which I used as an upper limit on what I could realistically expect from the agent. As a lower limit I trained "The Brute", an algorithm that ships with OpenAI Baselines.
The Brute is an algorithm that works but building up a tree of button presses that do well in the game, without even looking at the state representation (ie. the screen). It follows the best-found-so-far sequence of actions, introducing random perturbations to explore and find new, better paths. It is completely deterministic and very fragile, in that if you change the context even slightly the Brute will fail. You can read more about the Brute in this paper, but the relevant section is extracted below.
I set up the Brute and left it training overnight, for a total of about 16 hours. It was able to "learn" to complete about 3.5 laps of the Mario Circuit 1 course in GP mode as Koopa. It achieved a normalized reward score of about 91. Not exactly setting the world on fire, but useful as a lower-bound baseline. If I could get to a place where my agent performed somewhere between these two bounds I would consider this a success.
With variables extracted, reward function defined and baseline performance established, I set out to train an agent that achieved the following goals
Approximate human-level performance on Mario Circuit 1, 50cc, GP mode using both Luigi and Koopa as characters
Generalizes to other courses/characters
Most of the time was spent on criteria 1, but I made some progress on criteria 2. After some experimentation, I settled on PPO2 (read more here) as the algorithm because it has some nice properties for this application.
Allows use of convolutional neural network (CNN) policy allowing screen captures from the emulator to be used as state representation
Allows for multiple threads to be used in training, speeding up training time considerably.
I defined a done condition (the criteria for which the simulation is terminated) as any of the following:
Kart got stuck, defined by no checkpoint progress for 5 seconds or more.
I realized after the fact that the first criteria limits the ability of the agent to learn to recover from setbacks, but baby steps. So with all of this in place I set up the PPO2 algorithm with my data and let it train for a day or so. Live look at the results.
Not really, but the results were basically equally bad. It didn't even come close to the Brute benchmark, and in a lot of cases simply learned to drive in circles while hopping continuously. Some of the things I looked into solving related to the poor performance are summarized below.
Complex State Representation
The default CNN policy uses a state representation that is a snapshot of the game screen from the emulator. This is a large, complex input; 224 x 256 pixels. Example below of the image itself. Note that the bottom half of the screen essentially contains redundant information. Initially I truncated this section and retrained, but it was still taking a long time to converge.
Complexity of Action Space
On a standard SNES controller there are 12 buttons, resulting in 12! combinations of inputs at any given time. To address this I implemented a Discretizer that defined only 3 valid actions as accelerate (B), Accelerate while turning left (B + Left) and accelerate while turning right (B + Right). Braking is for losers anyway. This helped A LOT on subsequent runs, as it exponentially decreased the search space for experimentation with actions.
Bugs in Reward Scaling
The PPO algorithm is very sensitive to poorly scaled rewards and there was a bug in my code that meant the scaling wasn't being done correctly. Fixing this improved things markedly as well.
Reward Function Unstable/Non-Estimable
This was a concern, but it didn't end up being a reason for poor performance.
Some Googling uncovered the fact that the hyperparameter settings I was using were poorly configured for retro video gaming use cases. Once I substituted recommended values from the literature things improved again.
Once I addressed the issues with the action space, the reward scaling and the hyperparameter optimization, I was getting increases in performance, but the agent was still tending to get stuck in local optimum and wasn't converging as quickly as I would have liked.
This led me to believe that the complex state representation was still causing convergence issues. While I would have liked to simply let it train indefinitely, I wanted to try and get some results in a timely manner.
It was at this point I embarked on a lengthy and ill-advised quest to derive a simplistic state representation that would, in theory, generalize well across different contexts and tracks. I went ahead and implemented my own RetroEnv environment for Super Mario Kart that did a few custom things. I cannot understate how much work this was despite lifting a good chunk of the logic from this project that I found.
Read the track tilemap from RAM to get a 128 x 128 overhead representation of the track
Do a physics lookup based on the value of each tile
Create a 20 x 20 tile representation of the immediate vicinity around the player kart
Convert to grayscale image
Overlay player kart in red, enemy karts in blue
Feed this image to the CNN as the state representation rather than the full screen image
Effectively, it maps what the agent "sees" from the full screen image of the game to a low-res representation of the state immediately around the kart as below, including rotational and directional transformations.
This approach could be considered "cheating", in the sense that the learning algorithm has access to a state representation that isn't consistent with what a player sees, however in an effort to train a performant agent this was a tradeoff I was comfortable making. A bonus of this implementation is that it should generalize quite well to new domains and tracks, as the logic creates a uniform greyscale representation of the system state regardless of track type (grass, lava, ghost house etc) or layout.
Using the Koopa-trained model to complete the track generates an average normalized total reward value of about 160, and is capable of regularly finishing in the top 3. The Luigi-trained model fared slightly better, averaging about 170 total normalized reward and routinely finishing first. Both of these results were quite impressive when compared to my decades of experience that generated just 190 normalized reward score. See sample runs below.
The interesting part of this to me was the nuance that the agents learned when racing using the different drivers. There is a section on Mario Circuit 1 where the racer can elect to cut the corner, driving across the dirt and taking a shortcut. See below for details.
The tradeoff here is that heavier characters (Mario, Luigi, Bowser, DK) slow down significantly while off the track, making this shortcut a bad move. Lighter characters (Koopa, Toad, Peach, Yoshi) don't lose as much speed making it a viable option. The agents learned this tradeoff and raced accordingly, with Luigi-bot electing to stay on the track and Koopa-bot taking the shortcut. Fascinating stuff.
Final Conclusions and Thoughts
This project was very cool. I learned a lot and I think there's a bunch of extensions I could work on here that would keep me occupied for months. I guess the TL;DR version of my takeaway is this:
Reinforcement Learning Is Hard
Feel free to read more here, but some highlights for me are as follows.
It is horribly sample inefficient
A study concluded that most algorithms typically required at least 83 hours of play experience, plus however long it takes to train the model in order to achieve near-human level performance on Atari benchmarks. I feel like I short-circuited some of this by cheating a little bit with the simplified state representation, but this is still crazy-inefficient.
Many problems are better solved by other means if all you care about is performance
We saw this with Tetris and then again with the Brute algorithm. Why go through the pain of training a model-free RL algorithm when sometimes deterministic/brute-force methods can achieve equivalent or greater performance?
If you don’t care about generalizability just implement infinite monkeys.
Crafting reward function to capture exactly what you want is art, not science
A poorly-formed reward function can scuttle your entire project and routinely does. The stories of reward-farming, where an agent repeats certain behaviours infinitely in order to maximize reward are equal parts frustrating and hilarious. My agent that simply drove in circles was exploiting the fact that my early reward functions rewarded speed and didn’t penalize harshly enough for straying from the track, so it’s solution was to drive in an infinite loop, farming speed rewards for eternity
Even given a good reward, local optima can be hard to escape
I saw this during training – even with a well-defined reward and a generalizable state representation, training would routinely get stuck in local optima. Beyond hyperparameter voodoo it seems difficult to avoid this issue