I’ve been a huge San Jose Sharks fan since God knows when and ole Patty has been with us the whole time. Incredible to see him ascend into the 500 goal club the other night. Only 44 other players in NHL history shares this honor with him. As tribute, I thought we’d crunch some numbers about goalmaking today. Sports are a finicky thing though and extracting actionable insights will be difficult. Regardless, let’s see what we see.
This is the part I simultaneously love and hate. Touching new data always casts something akin to writer’s block on me. Everything feels so hazy and debilitating. I’ve learned to embrace the process though. Just staring into the exploration space causes things to start resonating and soon enough I’m fueled by the thought of what secrets may be hidden in the data. To speed things up, I usually try simplifying and solving for the obvious first. That seems to help.
For us today, I’ve decided not to bite off too much and just try predicting the number of goals a player will make in a season. Through that process, perhaps we will also gain insights into the anatomy of a goal. Hey, and if we’re lucky, maybe that would be actionable knowledge. Exciting!
- Get data
- Build model
- Generate analysis
For our problem we need skater statistics over many NHL seasons. I’ve found two sources offering this:
We probably could look for CSV exports from these sources, but let’s try scraping it instead as practice. BeautifulSoup won’t be useful for the actual scrape since these are interactive websites. We’ll need Selenium.
I won’t discuss code too much in this post since I want to focus on the analysis, but you can see the code linked below. Basically I’ve created functions to scrape each source and used a loop to grab all the seasons from 2009 to 2017. After scraping, I translated the raw HTML files into CSV files to prepare for consumption in pandas.
In summary, our final data set consists of:
- 7.5 NHL seasons from 2009 - 2017
- 2000+ forwards (Left Wing, Center, Right Wing)
- 45 stats each
Model 1: Linear Regression
To start things off, let’s explore if any features already have linear relationships.
Nice, 13 features. This is a good start. They are all features that makes intuitive sense too.
- Shooting (shots on net, misses, etc.)
- Presence (offensive/neutral/defensive zone faceoffs)
- Pressure (takeaways, giveaways)
- Fatigue (age, time-on-ice)
Since we have so many data points, let’s use a test-train split of 40-60 and use a ten-fold cross validation on it to generate our linear regression model.
- RMSE 4.10658
- R^2 0.77495
Not too bad for a first shot. Notice that our model under predicts players that score 40-60 goals per season. Maybe we can improve that with lasso.
Model 2: Linear Regression with lasso
- RMSE 4.09976
- R^2 0.77570
Hmm. No improvements. Still under predicting high scorers. Let’s try random forests.
Model 3: Random Forest
Instead of using only the 13 features we initially engineered, perhaps the computer can pick up on signals from the other features. With the random forest technique the computer will have more chances in teasing out what may or may not be important for a player’s goals per season.
- RMSE 2.70191
- R^2 0.90259
Wow. Significantly better. Those high scoring players though…
Model 4: Gradient Boosting
- RMSE 1.80720
- R^2 0.95641
Okay. This is looking rather good. It’s still under predicting high scorers, but otherwise, it looks almost too good to be true. Maybe there is some strong connection between the shooting stats and the goals scored. That passes the sanity check that shooting more would equate to more goals. Let’s try taking out the shooting stats and see what happens.
Model 5: Gradient Boosting without Shooting Stats
- RMSE 4.57669
- R^2 0.72049
As expected, the model doesn’t fit as well as earlier. Still actually quite good though; nearly as good as our vanilla Linear Regression. Hmm. Unfortunately though, after all these attempts, it seems our data might be incapable of modeling the high scorers properly. Regardless, this model will still offer insights into what contributes to an average player’s goals besides just shooting. That’s what we were after to begin with.
With ml-insights we can get a bigger picture on how each feature impacts our predictions with all else held equal.
Our model is identifying quite a few “presence” features as important. The most important, OZFO, being “Offensive Zone Faceoffs”.
It seems the more offensive (OZFO) and neutral zone (NZFO) faceoffs you take, the more likely you’ll score. This makes a lot of sense. If you are closer to the enemy net, the more chances you have for scoring.
What is also interesting are the “pressure” features. You’d think getting pucks stolen from you would equate to scoring less, but it seems like both takeaways (TK) and giveaways (GV) results in more goals.
Don’t mind the drop off after 90 giveaways in the graph. I just don’t have any data points for a player giving away the puck that many times. What I think is happening with giveaways is that if you are giving away the puck you are most likely heavily pressuring the defense and they just happened to knock you off your attack sequence. The more you try to attack, the more scoring chances you have regardless of sometimes giving it away.
Wow. I did not expect the results to be so intuitive. I’m sure any coach could have told you these results, but this absolutely confirms the fundamentals of the game.
- Shoot more, score more
- Attack, attack, attack. If you are in the offensive zone, you have more chances.
- Pressure, pressure, pressure. If you are aggressive in stealing and pushing your plays regardless of what happens, you have more scoring chances.
This reminds me of the Todd McClellan days of the Sharks. Our whole game plan back then was to just attack hard and shoot often. It all makes sense now.