CS 4800 final research paper

NHL: An Expected Goals Model
Testing different modeling techniques to predict the
success of shots in the NHL
Richard Ramsey
12/18/2015
CS 4800 – Fall 2015
Usinga sample of shotstakeninthe NHL from the 2008-09 seasonto the 2014-15 season,Itestedthe
significance of shotdistance,relative angle togoal,andotherfactorsinpredictingwhetherornot the
shotwas scored.I developedtwodifferentformsof models,one alinearregressionmodelthatwith
interactionsbetweendifferentindependentvariables,andthe othera linearmixed-effectsmodel that
treatedshottype as a groupingfactor to generate individual interceptsandcoefficientsforeachshot
type. The resultingmodelsgave anexpectedprobabilityof eachshotgoingin.

INTRODUCTION
Over the past decade, the field of analytics and statistical analysis has grown rapidly in
hockey. One of the most important questions that everyone involved would like to understand is
which shots are most likely to be converted into goals. The fluidity of hockey play makes it
difficult to capture all of the variables that contribute to the game situation in each shot, but
being able to better understand the differences between high-percentage and low-percentage
shots has far-reaching implications for analysis. Some in the NHL community have referred to
this line of analysis as looking at shot quality (Krzywicki). The work done in this field has
largely been done privately for the benefit of NHL teams, who have incentive to not share any
advances they have made.
In 2012, Brian McDonald presented an expected goals model for the NHL at the Sloan
Sports Analytics conference. His model used statistics accumulated over the course of a game,
such as shots, turnovers, and hits, to predict the total number of goals that a team would score
(McDonald). It turned out to have significant predictive power, and could be used to evaluate
both teams and players based on the expected goals. However, this model didn’t examine each
individual shot, which left the question of shot quality open for further analysis. Being able to
better determine the quality of a shot would allow more granularities in analysis of both players
and teams. Using an individual shot-based expected goals model would allow for a better
understanding of which players are better or worse than average at converting each type of shot,
or what teams create or concede high-percentage shots. It could even have implications for
offensive and defensive strategy, if the relationships between shot distance and angle were
different than the conventional wisdom. The hypothesis I generated, based on this, was that shot
distance and relative angle to goal were both significant predictors of shot conversion.

DATA AND METHODS
In order to obtain a large sample of data to work with, I used the nhlscrapr package in R.
This package scrapes NHL play-by-play data from the NHL’s official site, and converts the raw
play-by-play data into a useable data frame in R. The data available through the nhlscrapr
package goes back to the 2002-03 season, but x-coordinate and y-coordinate data only go as far
back as the 2008-09 season. Given the hypothesis and hockey intuition that the distance from and
angle at which a shot is taken will have significant effects on conversion rate, I needed to restrict
my sample to only include shots from the 2008-09 season and later.
Using this sample of data from the 2008-09 to the 2014-15 season, I had 530,053 shots to
work with. However, not all of these shots came in the course of typical game situations, because
the nhlscrapr package just pulls every play-by-play event logged by the NHL. First, I filtered out
any penalty shots and shootout goals, because those are completely different situations from the
natural state of play. Second, I removed empty-net goals, because I felt that these would distort
the predicted conversion rates from different angles and distances, given that there was no goalie
in to stop them. After removing these data, there remained an inconsequential number of shots
without x-coordinates or y-coordinates, which I had to remove.
In the final sample of shots, I created two variables based on the play-by-play data.
Relative angle represented the shooter’s relative angle to the goal on a scale ranging from 0, the
widest possible angle, to 1, straight in-line with the goal. A binary variable pp represented
whether or not the shooter’s team was on a power play at the time, meaning they had at least one
more skater on the ice than the opposing team.

Table 1 – Summary of Sample Data
Shot Type Total Shots Goals
Backhand 41876 4313
Deflected 6778 1415
Slap 113582 6260
Snap 74741 6491
Tip-In 23090 4473
Wrap 6874 382
Wrist 244965 20392
From this sample, I used random sampling to create a build sample and a holdout sample
of shots. The build sample contained 75% of the overall sample, while the holdout sample
contained the remaining 25%. In order to avoid overfitting in the models, I only used the build
sample to train the models, and tested their predictive power on the holdout sample.
I looked at two different forms of predictive models to begin, and refined the formulas of
each of those models through holdout testing. The first was a simple linear regression model,
which estimates a coefficient for the linear relationship between the dependent variable and each
of the predictor variables. The second model was a linear mixed-effects model, which allows
factors in the model to be observed as random variables, rather than be treated as fixed
parameters. This means that the grouping factors specified in the model are trained to have
random intercepts and slopes in relation to other variables. Looking at the type of shot, as
categorized in the scraped data, was an intuitive candidate for such a grouping factor. I tested this
form to examine the relationship between the type of shot and other predictor variables in the
model, so that the other model variables had different effects for each type of shot (Bates et. al).

RESULTS
Using these techniques, I tested out different formulas and relationships between
independent variables for both the linear regression and the linear mixed-effects model.
Table 2 – First stage of the linear regression model, testing significance of distance and relative angle
The first model that I tested was a linear regression using distance and relative angle as
the independent variables, in order to test their significance in predicting shot conversion. Later
iterations of the linear regression model included the shot type.
Table 3 – Final linear regression model

After testing out the inclusion of different interactions between independent variables, as
well as the overall formula, the above model was the result. The log of (1 + shot distance) is
interacted with relative angle, and is interacted with shot type. Adding one to shot distance is
necessary in order to put the log transformation on it. In addition, a binary variable pp indicating
if a team is on the power play is also included, and is significant. There is no log transformation
on shot distance in its interaction with shot type because the relationship varies significantly by
shot type, so the transformation was not necessarily the best representation of said relationship.
Table 4 – Summary of final linear mixed-effects model
The development of the linear mixed-effects model began with the key assumption that
using shot type as a grouping factor was intuitive. Initially, I included shot type as a grouping
factor just for the intercept. Shot type was found to be convergent as a grouping factor for the

intercept and relative angle coefficient, meaning that each shot type had its own intercept and
coefficient when interacted with relative angle.
Table 5 – random effect coefficients of shot type
The interaction between log(1 + shot distance) and relative angle was also included in
this model, as well as the binary pp variable.
DISCUSSION
As hypothesized, the distance and relative angle of the shot are both strong predictors of
shot conversion at a high level of statistical significance. In order to evaluate each of the models,
I initially looked at the significance of the overall model and the coefficients of each variable and
interaction. In order to differentiate between models, however, I tested on the holdout sample.
Table 6 - Gains chart for simple linear regression model

Using the gains package in R, I created a gains chart for each of the prospective models
on the holdout sample, in order to evaluate the predictive power of the models on data they were
not trained on. The gains chart orders data points in the sample by their predicted probability, in
this case the chance that a shot is scored. Depth of file is the percentage of the sample population
(10 is the first 10 percent by this order). Mean response (Mean Resp) is the mean of the binary
variable indicating whether or not the shot was scored. Cumulative percentage of total responses
(Cume Pct of Total Resp) is the percentage of goals captured in the depth of file when ordered by
the model’s probability. For example, 26% of all goals in the holdout sample were in the top
10% of calculated probabilities by the simple linear regression model. Higher cumulative
percentage of total responses in a low depth of file means that the model has more predictive
power.
Table 7 - Gains chart for final linear regression model
The final linear regression model generated a lift over the simple linear regression model
that I began with, especially beyond the top 10 percent of predicted chance of scoring. The
interaction between shot type and shot distance captured an important relationship in terms of the
probability of scoring.

Table 8 - gains chart for the final mixed-effects model
The linear mixed-effects model saw similar gains as the final linear model, though it did
not have as much of a lift over the initial linear model. There is a slight lift over the final linear
regression in the 10th to 20th percentile of predicted probability, but that lift is not sustained over
the whole sample. The mixed-effects model may require more detailed interactions and nested
grouping factors, but the tradeoff in computational complexity and ease of explanation may not
be worth the marginal gains. Both of the final models generated lift over the initial linear
regression, and we can conclude that interactions with shot type and other independent variables
have additional predictive power beyond just looking at shot distance and relative angle.

WORKS CITED
A.C. Thomas and Samuel L. Ventura (2014). nhlscrapr: Compiling the NHL Real Time Scoring
System Database for easy use in R. R package version 1.8.
http://CRAN.R-project.org/package=nhlscrapr
Bates D, Maechler M, Bolker B and Walker S (2014). _lme4: Linear mixed-effects models using
Eigen and S4_. R package version 1.1-7, <URL:
http://CRAN.R-project.org/package=lme4>.
Craig A. Rolling (2013). gains: Gains Table Package. R package version 1.1.
http://CRAN.R-project.org/package=gains
Krzywicki, Ken. "NHL Shot Quality 2009-10." Hockey Analytics. Hockey Analytics, 22 Oct.
2010. Web. 13 Oct. 2015. http://hockeyanalytics.com/2010/10/nhl-shot-quality-2010/
McDonald, Brian. "An Expected Goals Model for Evaluating NHL Teams and Players." Sloan
Sports Analytics Conference. MIT, 3 Mar. 2012. Web. 13 Oct. 2015.
http://www.sloansportsconference.com/wp-content/uploads/2012/02/NHL-Expected-Goals-
Brian-Macdonald.pdf
Douglas Bates, Martin Maechler, Ben Bolker, Steve Walker (2015). Fitting Linear Mixed-
Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48.
doi:10.18637/jss.v067.i01.

CS 4800 final research paper

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (16)

Semelhante a CS 4800 final research paper

Semelhante a CS 4800 final research paper (20)

CS 4800 final research paper