Both ordinary least squares and censored regression statistical model for predicting baseball attendance at MLB games in the 2008-2012 are presented, building on prior work of Robert J. Lemke, Matthew Leonard, Kelebogile Tlhokwane, “Estimating Attendance at Major League Baseball Games for the 2007 Season”, Journal of Sports Economics, Vol.1(3), 316 (2010).
Estimating Attendance at Major League Baseball Games for the 2008-2012 Seasons
1. Streips, Suen, Sullivan, Zerweck 45-752 Project (Trick) April 30, 2014
1
INTRODUCTION
Several factors such as starting pitcher, temperature/weather, team record, traffic, and
more play a role in attendance. However, these factors are unpredictable and cannot be used for
planning ahead. As consultants to Major League Baseball (MLB), our group has the primary goal
of increasing attendance through statistical analysis. Using data from over 12,000 games over
four years, we make recommendations to the MLB on changes they can make to the schedule to
increase attendance.
THE SPECIFICATION (MODEL)
The choice of estimation procedure builds upon a prior study of MLB baseball attendance
by Lemke et al. of the 2007 season. Both game attendance and log attendance are used as the
dependent variables in ordinary least squares (OLS) and censored regression (CR) models. Right
censored regression is used to model the effects of capacity on “sell-out” games. All models are
fixed-effect (FE) models in which each home team receives its own fixed-effect to account for
local market conditions and intercity variations. We assume that unobservable factors that might
simultaneously affect the LHS and RHS of the regression are time-invariant. Explanatory
variables include: time factors (day of week, time of day, year, month); factors that influence
attendance (interleague and opening day games and games on holidays); and, whether two
games are played in a city at once (New York City, San Francisco Bay Area, Chicago,
Washington, DC, and Los Angeles). The OLS models are AR(1) to account for correlation of
errors in the time-series data. The Newey-West estimator is used to correct for autocorrelation
and heteroskedasticity in the error terms of the OLS models, serving to weaken the assumptions
of the model. Nine dummy variables control for the day of the week and the time of the game.
There is a separate dummy variable for each day, Monday through Friday, plus a variable for
playing a day game during the week. Saturday and Sunday games are each further separated by
time of day. Additionally, there are are five dummy variables to control for the month and four
more variables to control for the year.
THE DATA
The data includes the date, time of day, and attendance records of all MLB games played
over the 2008-2012 seasons (inclusive) for a total of 12,100 observations. Mean attendance at
MLB games was 30,860 people for the period in questions, with a range of 8,269 (TOR vs. TMB
on April 22, 2008) and 57,099 (SFN v. LAN on April 13, 2009) (see full detail of descriptive
statistics at Table
2, Appendix). The observations also include whether or not each game was at
capacity, was played on opening day or a holiday, involved interleague play, or was held on the
same day as another game in the same metropolitan area (as indicator variables).
REGRESSION RESULTS
When using attendance or log attendance as the dependent variables, estimated
coefficients are interpreted as changes in attendance or percentage changes in attendance
(respectively). For example, under the OLS model, a Thursday night game averages 3,288 fewer
attendees than a Sunday afternoon game (see OLS regression at Table
3, Appendix). Using log
attendance, the same data would be interpreted as 14.41 percent fewer in attendance. The
baseline is attendance at a Sunday afternoon game held in FLO in April 2008 that is not on
opening day, and not on a holiday or an interleague game (21,007 people).
Based on the CR models, the semi-log functional form is judged to be the better model
based on Akaike info criterion (0.427297 vs. 16.6486). Only OAK and the simultaneous game
cities (except NY2) are not statistically significant factors in both CR models, which confirm the
conclusions that may drawn from the OLS models.
2. Streips, Suen, Sullivan, Zerweck 45-752 Project (Trick) April 30, 2014
2
The proportion of the variance in attendance and log attendance that is explained by the
OLS models are 0.6919 and 0.6752, respectively, with the adjusted R-squared values being
slightly lower (0.6904 and 0.6752). All OLS model coefficients are statistically significant (within
0.05 significance) with the exception of the simultaneous game variables, Sunday night games,
home team game attendance at OAK, and (for the log attendance model) Friday night games
(see Table
4, Appendix). The simultaneous game coefficients were left in the model to support
the findings and recommendations of this report. Removing these variables from the model did
not have a significant impact the ability of the model to explain variability in attendance. The signs
and magnitudes of the coefficients are in alignment with expectations relative to the baseline
(FLO having the lowest league attendance) and with the descriptive statistics of the data set (see
Table
2, Appendix). Leverage plots were performed on each coefficient without suggesting
nonlinearities. The model was rejected by the Ramsey test, but given the large time series data
set, we hold the Ramsey test to be uninformative. Choosing the functional form to be
untransformed or semi-log is supported by the academic literature.
From the model we make a few general observations: Monday through Thursday games
draw significantly fewer fans than Saturday or Sunday afternoon games. Day games in general
offer slightly higher attendance than night games. Attendance is expected to be less in
September compared to July and August, and is expected to be more on major holidays.
FINDINGS AND RECOMMENDATIONS
Monday vs. Thursday Off Days
The most commonly scheduled off days in the league are Monday and Thursday, when
teams often travel home or away for a new series. Viewing our OLS regression results (Table
3,
Appendix), we see that Monday and Thursday both imply a statistically significant negative
attendance effect when compared with the baseline of Sunday daytime games. At first glimpse, it
seems that Monday indicates a larger negative effect on attendance than Thursday, but to be
certain, we can conduct a Wald Test (Table
7, Appendix).
For this Wald test, we made Monday + Daytime = Thursday + Daytime our null
hypothesis. This resulted in a p-value of 0.1865, which means that we do not have enough
evidence to reject the hypothesis at a 0.05 level that Monday and Thursday games are the same.
From a statistical standpoint, there is no difference between Monday and Thursday games, but
from a managerial perspective, it might be interesting to know that there will occasionally be
differences. It may be prudent to slightly favor Monday off days when scheduling because the
Monday coefficient has a larger negative effect on attendance.
Annual Attendance
Using numbers from the OLS regression (Table
3, Appendix), we put together an annual
attendance graph (Figure
1, Appendix) as implied by the annual indicator variables (2008 - 2012).
This information will give us the means to analyze some very general attendance trends for Major
League Baseball.
We notice that our baseline year of 2008 indicates peak annual attendance, followed by
strong declines through 2010. The trend then turns upward with some weak growth in 2011 and
2012. We conclude that the trend in attendance is directly related to the Great Recession, which
officially lasted from December 2007 to June 2009 in the U.S (source:
http://www.nber.org/cycles.html).
Looking at a chart of Real GDP (source: http://www.multpl.com/us-gdp-inflation-
adjusted/table, Figure
2, Appendix), we can see that baseball attendance seems to follow these
trends, lagging by about 1 year. One very important concern is that baseball attendance has not
3. Streips, Suen, Sullivan, Zerweck 45-752 Project (Trick) April 30, 2014
3
recovered as quickly as the rest of the American economy. While the league’s growth trend is
positive, it should try and identify other factors that may be causing slower recovery. It should
also use this data to anticipate attendance in the event of a future economic downturn. If MLB can
use GDP as an indicator, it can better prepare and anticipate for losses caused by poor
attendance.
Should the MLB be concerned with multiple intra-city games on the same day?
While none of our OLS model two game variables (NY2, BAY2, CHI2, DC2, and LA2)
were statistically significant at the 0.05 percent level, we believe there is still a useful
interpretation to some of the coefficients. Eighty-seven percent (1-0.1264) of the time, when both
NY teams in NY play, there will be an increase of 1,564 in attendance. Eighty-five percent of the
time, when both Bay Area teams play in the Bay Area, there will be a 1,102 drop in attendance.
Additionally, 80% of the time, Chicago will see a 656 person increase in attendance. NY2 is
statistically significant under our CR model analysis, further highlighting the managerial
significance of simultaneous games in the New York metropolitan area.
These numbers are what we call managerially significant. While not enough to make
more certain statistical predictions, we recommend using this data to make educated decisions,
with the realization that they will occasionally be incorrect. The NY and Chicago positive effects
could possibly be explained by the rivalry between the intra-city teams. Advising NY and Chicago
teams to work together to schedule same day home games would be a good idea, but it should
be emphasized that this should not be a priority. Considering that the sample size for having two
NY games is less than 25 per season, we felt that there could have been other factors (e.g.
Special City-wide events) affecting attendance on those specific days that are not accounted for
in the data.
The Bay Area is unique because of the negative overall effect implied. One possible
explanation is that the Giants are much more popular than the A’s, as evidenced by the HTeam
coefficients of 15,789 for the Giants and 198 for the A’s (HTeam=”OAK” is far from statistically
significant, suggesting no effect on attendance). This data suggests that when the Giants and A’s
play on the same day in the Bay Area, the Giants overpower the A’s and there is an overall
negative effect. It also could be explained by the fact that these two teams do not have a rivalry
with high levels of animosity, unlike NY and Chicago.
Should the MLB care about day versus night games?
Sunday afternoon games are the baseline in the regression, Saturday, and Sunday night
games are all better than a weekend Day Game. Saturday and Sunday night games experience
an overall increase of 4,209 and 958, respectively. The main explanation for this is that people
generally have more free time on weekends. Furthermore, weekday (including Friday) day games
on average have 757 more in attendance than weekday night games. Our intuitive explanation for
this is that weekday night games do not end until later in the night and many people have to work
the following morning. Additionally, many people take advantage of the “businessperson special”
games and promotion/giveaway games that are in the day time.
Should the MLB move the schedule to start later in April and end in October?
Attendance increases as the season continues, peaking in July and August and dipping
in September, though remaining higher than April (Figure
3, Appendix). While the end of the
season still has better attendance than the beginning, there is more uncertainty in cold weather
cities, the start of the football season, and how the playoffs will affect attendance. However, the
combined effect of summer weekend games is even more powerful (Table
1). For this reason, we
would recommend eliminating as many April and September games as possible and replacing
them with day/night weekend doubleheaders in July and August.
4. Streips, Suen, Sullivan, Zerweck 45-752 Project (Trick) April 30, 2014
4
Saturday Day Saturday Night Sunday Night
July + 6390 + 7943 + 4692
August + 5547 + 7100 + 3849
Table
1:
Coefficients
of
Saturdays
and
Sundays
during
Peak
Months
Because this recommendation would likely be resisted by the player’s union, we would
also recommend starting and ending the season later. Overall, the data suggests that doing so
would increase attendance; however, we remain cautious as autocorrelation could affect the
prediction.
CONCLUSION
In conclusion, our study of attendance at MLB games for the 2008-2012 seasons yield
the following observations:
The league should not be concerned with Monday versus Thursday off days as the variables
were not statistically different from each other. While baseball attendance had not reached 2008
levels by the end of 2012, overall attendance seems to be correlated with the Great Recession
and disposable income. New York, Chicago, and Bay area teams should all be concerned with
having multiple intra-city games on the same day. However, this should not be a major concern
as there is a 0.15-0.20 probability this effect will not happen. Day games have higher attendance
than night games on weekdays, but this effect is reversed and magnified for Saturday and
Sunday. If possible, the league should cut games from the beginning of the season in April and
make them up in the form of double headers on weekends in July and August. If this is not
realistic, the league should cautiously begin to start and end the season later in the year, but
beware of playoff and temperature effects.