Mariners Playoff Drought
April 2018
Putting my frustration as a Mariners fan to good use, I accessed public data sources to explore my favorite team's league-leading 16-year playoff drought. This project used web scraping and linear regression on a dataset dating back to 2002 that included game statistics, coaching tenure, and team payroll.
Motivation
I have been a Mariners fan all my life. It's a tough gig. Since the team launched in Seattle in 1977, more than four decades ago, the M's have made it to the playoffs a whopping three times—in 1995 and 1997, during my formative childhood years when our family spent 40+ nights a year at the Kingdome (RIP), and again after the record-setting 116-win 2001 season. We have never made it to the World Series. The current 16-year drought is the longest active playoff drought in any major professional sport, longer even than the NFL's Cleveland Browns. For this project, my urgent one-word research question was simply, "WHY??"
the Dataset
I used the Beautiful Soup library and Selenium WebDriver in Python to scrape a dataset from Baseball Reference and USA Today. Data included season totals for batting, pitching, and fielding statistics; the tenure in years of each team's manager (head coach); and total payroll for each American League (AL) team from 2002 to 2017. The Houston Astros moved from the NL to the AL in 2013, so they are included from then on. Because I was trying to model a continuous (vs. categorical) variable, I set the outcome of interest as percent of games won (vs. actual playoff berths). This was a function of this project seeking to test linear regression (vs. classification) skills.
Model
I used a linear regression model to explore this problem. It is easy to build simple but useless models with this dataset. For example, using just average runs scored and average runs given up yielded highly statistically significant coefficients and a high R-Squared value of 0.86 (the max is 1). That looks good, but it's virtually useless; it's equivalent to saying that to win games, teams should just play better!
Instead, I built a model that used data more within the immediate control of the team, including specific aspects of offense, defense, and pitching (e.g. singles and walks instead of "hits") plus manager tenure, team payroll, and a team dummy flag to check for a "Mariners curse." Many of these were correlated with winning but not as directly as the model above (see scatterplot for example). My resulting model had an adjusted R-Squared value of 0.83—slightly less than the useless model, but still explaining a respectable 83% of variance in teams' win-loss records.
Results and Recommendations
I trained the model on 70% of the data and tested on the remaining 30%. After regularizing with Elastic Net Cross-Validation (weighted toward Ridge regularization, which reduced many features to zero), a few features came back with small but statistically significant coefficients.
Unfortunately, no one single feature seems to be driving the drought; it's a little of everything. We can say that singles, home runs, walks, being hit by pitch, and intentional walks are correlated with higher win percentages, and being caught stealing and having runners left on base are negatively correlated. Neither manager tenure nor team payroll was significant in the current model. In a piece of potentially good news, the "Mariners curse" variable was not significant, either.
Next Steps
This model could be extended by adding in richer data features to help round out the picture of what has happened since 2002. These data are available but were not included in the original analysis because they are either behind a paywall or are difficult to scrape.
Injury Data: Who was injured and for how long? How many million dollars of talent sat on the bench?
Top Players: How many award-winning players did each team have, and in which positions?
Leadership: When did ownership and/or General Managers change?
Rising Talent: How strong was their farm system? How did their AAA team perform?