Hockey is a fantastic sport to watch. It's fast, physical and the skill of the players involved is mind-blowing. Unfortunately those same traits make it a very difficult sport to properly analyze, unlike some of it's North American brethren such as baseball and football. Analytics in the NFL, NBA and MLB are far more mature in terms of the data they track and in terms of the methods they employ to model various aspects of their respective sports. It also helps that those leagues have invested significant money in the infrastructure required to capture data at a level of granularity that lends itself to advanced analytics. In the NHL the burden of that work has typically fallen to hobbyists and enthusiasts who spend countless hours poring over video footage and manually annotating data so that even the most basic of useful stats can be computed.
Exciting then, that the league has announced that they will be rolling out player tracking technology in the NHL in time for the start of next season (2019-2020). A good summary of the history of this initiative and the technology involved can be found here. It looks increasingly likely however, that the majority of this data will remain proprietary, to be owned by the league, possibly made available to individual teams, and sold for profit to betting agencies. It looks very unlikely that any of this granular data will make its way into the open-source community. This is disappointing, as the public hockey analytics community movement has been responsible for many of the innovations in analyzing the game in recent years.
Given these developments, a few months ago I started a project to see how difficult it would be to hack together a system that generates data at this granular level. Using existing camera infrastructure coupled with computer vision and machine learning techniques, I hoped to be able to develop a system that could annotate video footage and output coordinates relative to the rink.
I put together a Powerpoint presentation and condensed it down into a 5-minute lightning talk for the 2019 SeaHAC conference. You can view the slides here and my lightning talk here. I'd thoroughly encourage you to watch the excellent content from all of the other great presenters, and would encourage anyone in the space to attend next year. It was a fantastic event.
In this post I'm going to talk about some of the high-level personal takeaways from the project. For more detail, please check out the slides.
This Stuff is Hard
There's a reason this hasn't been done yet - this stuff is hard! There are a host of issues with using a video-only approach which make it difficult to produce accurate results. Some are constraints within the physical world, for example, not all players on ice can always be captured at a given time by a single camera. The puck is probably only visible around 60% of the time when using the default side-on broadcast camera. Some issues are native to the sport itself - it is very fluid and dynamic in nature. The technology required to do accurate object-detection at scale is still at the bleeding edge of what is possible in terms of computer vision techniques, and not available to everybody without cost. All of these issues combine to make this a tough problem to crack.
Transfer Learning is Very Powerful
Transfer learning is a technique used to bootstrap the training process by using a set of weights generated for the same neural network architecture trained on a not-dissimilar problem. The implication being that those weights will be closer to the optimal weights for your similar problem than randomly initialized weights. This can result in much faster convergence times and in some cases, much reduced data volume requirements to reach a useful level of accuracy. This article explains it in much clearer detail than I can. For this project I used weights from Google’s implementation of the Mobilenet architecture as trained on the COCO (Common Objects in Context) data set. See also the related whitepaper.
This allowed the training process to converge in ~3 hours using just 250 images, producing reasonably accurate results. Built from scratch this model would ordinarily require hundreds of thousands of input images and potentially weeks of processing time to converge.
Data Integrity is Key
Garbage in, garbage out, as they say. Input data integrity is crucial in a project such as this, especially so when generating data by hand. After the first training run I noticed that the model was mistaking the left and right sides of the ice when detecting reference objects (corners, hash-marks etc). Upon further inspection, it turned out I had made annotation errors in ~15% of my input images. Because I was mistaking left and right, so was the final model. I promptly wrote tests for my input data to check for this and a host of other potential errors. Test your data folks.
On Sampling Bias
I used highlights packages sourced from YouTube to create input data for the model training process. It was easy to source, sample and process. Unfortunately, using only highlights packages presents a form of sampling bias. The nature of highlights meant that the majority of frames sampled were from the offensive/defensive zone rather than rush or neutral-zone game situations. I was worried that detection accuracy might suffer for those under-represented samples, so I made sure to try and manually sample a roughly even number of images in both neutral and offensive/defensive situations when doing the annotations. This of course led to a more concerning type of bias - object class imbalance. Naturally there were more observations of skaters than goalies or the puck for example. Again, I tried to manually up-sample the numbers here, but this meant biasing my inputs further in terms of on-ice location. I think that smoothing object class imbalance would be more important here, but that's based purely on intuition at this point.
Coordinate Translation for Under-Determined Systems
This was hard. Taking pixel-space coordinates from the model (illustrated below) and mapping them to overhead rink-space coordinates proved to be very challenging. This is because I lacked enough information about the camera being used to accurately make this translation. I experimented with a host of different approaches here, including homography, vanishing point detection and a couple of plain old-fashioned brute-force approaches. Nothing really generalized well. With more information about camera position, rotation, zoom-level etc. for a given frame, this problem becomes much easier. With access to multiple camera feeds it likely becomes easier still.
This project was really fun to work on. I learned a lot about deep learning, computer vision and transfer learning. I was surprised at the level of accuracy I was able to achieve given the limited size of the input data; the final model I trained used just 1,037 images. With some further development I think this solution might have value beyond academic exercise. There's a laundry list of things I'd like to tackle next:
1. Further experiments in coordinate translation
2. Improve slow performance for frame-by-frame inference
3. Experiment with different neural network architectures
4. Apply post-processing logic to smooth choppy frame-to-frame object detection
5. Compute accuracy metrics against test-set to properly benchmark model performance
I've had to shelve this project for a while due to work commitments. Hopefully I can find time to tackle some of these things soon!
Again, please feel free to check out the slides here. They go into a lot more detail about specifics than I have in this post. I plan to make most of the code available on my Github, once I've improved the quality to something higher than it's current "career-threatening" level. This project wouldn't have been possible without a host of online resources, from which I learned many, many things. Here are some of them, in no particular order.
Dat Tran's object detection blog post was what I followed basically start-to-finish
SeaHAC 2019 provided the push I needed to keep going with the project
The labelImg utility, where I spent more time than I care to admit
The Keras image classification tutorial for helping me build a model to filter my dataset
Wyshynski's article on the history of player/puck tracking for context
Google's Tensorflow Object Detection API for making the whole thing possible