538's MLB Elo system attempts to do two distinct things:
Only games played thus far in the 2016 season have been analyzed (n = 1364).
1) Long term predictions for post-season probabilities (make playoffs, win division, win world series)
and
2) Short term predictions for outcomes given some Elo adjustments like opening pitcher, travel time and rest.
This post will deal with the latter.
In an attempt to improve my programming chops, I used a python script I developed to scrape and re-organize the data from FiveThirtyEight's website for the 2016 season thus far. And, because I am most familiar with Excel, I used it to develop graphs and perform a few extra calculations here and there.
Methods
Only games played thus far in the 2016 season have been analyzed (n = 1364).
This analysis was kept to only looking at how well the unadjusted Elo rating was at predicting the outcome of the game compared to the adjusted rating. There are a couple ways to tackle this problem:
1) Convert the difference in win probabilities of the two opposing teams into a point spread, and then compare that point spread to the outcome.
or
2) Analyze the % chance of win compared to the outcome.
Due to a personal uncertainty I had with option 1 (mainly the constant used in the formula to convert a difference in Elo ratings to a point spread), I relegated this analysis to option 2.
To do this I took the win expectancy, which is continuous, and converted it to discrete values with bounds of 5%.
Ex:
0 - 4.9 % chance of win = 2.5%
5-9.9% chance of win = 7.5%
10-14.9% chance of win = 12.5%
and so on
All played games to date for the 2016 season were then converted to these bounds. Then, a win or loss binary value was assigned for each game (W = 1, L = 0). A pivot table was used in Excel to take the percentage of games won in each bound.
If the Elo system was perfect, for a given category that would have a win percentage of the same value. Granted, the lower the sample size, the less likely this will be.
Results
After subtracting the predicted percentage from the actual percentage, this chart is the result for bounds of 5% (note that excess bounds with no data were deleted). If the bars go below 0 %, it suggests that Elo underestimated games won, if the bar is above 0%, Elo overestimated games won. An ideal distribution would have very small bar heights.
Example:
1. Of the 132 games in the 40-45 raw Elo bound, on average 42.5% of them should have been won. According to the chart only 36.1% of those games were actually won (42.5 + (-6.4)) = 36.1
2. Of the 78 games in the 65-70 adjusted Elo bound, on average 67.5% of them should have been won. According to the chart 67.5325% of those games were actually won.
etc
-------
Labels for each bar cluster indicate the number of total (n) games analyzed for each bound in the format [adjusted n] / [raw n]. Generally speaking, the adjusted Elo seems to be slightly better at prediction, but not by a wide margin for any of the bounds. Furthermore, Elo in general seems to be poor at predicting teams with a very low win probability (25-30%), but given the low sample size (n = 5), this conclusion may be too hasty.
Overall, Elo win expectancy seems to be very accurate as sample size increases, at least given the data analyzed, to about +/- 10%, excluding the 25-30 bound.
Results
After subtracting the predicted percentage from the actual percentage, this chart is the result for bounds of 5% (note that excess bounds with no data were deleted). If the bars go below 0 %, it suggests that Elo underestimated games won, if the bar is above 0%, Elo overestimated games won. An ideal distribution would have very small bar heights.
Example:
1. Of the 132 games in the 40-45 raw Elo bound, on average 42.5% of them should have been won. According to the chart only 36.1% of those games were actually won (42.5 + (-6.4)) = 36.1
2. Of the 78 games in the 65-70 adjusted Elo bound, on average 67.5% of them should have been won. According to the chart 67.5325% of those games were actually won.
etc
-------
Labels for each bar cluster indicate the number of total (n) games analyzed for each bound in the format [adjusted n] / [raw n]. Generally speaking, the adjusted Elo seems to be slightly better at prediction, but not by a wide margin for any of the bounds. Furthermore, Elo in general seems to be poor at predicting teams with a very low win probability (25-30%), but given the low sample size (n = 5), this conclusion may be too hasty.
Overall, Elo win expectancy seems to be very accurate as sample size increases, at least given the data analyzed, to about +/- 10%, excluding the 25-30 bound.
Limitations
One of the glaring limitations of this method is that the actual adjustment of the Elo rating applied to the raw rating does not affect the win % chance very much. As a result, the number of times an adjusted rating actually changes the category of win percentage (i.e. a 30-35% chance of win changing to a 35-40% chance) is low. In fact, of the 1364 games scored at the time of writing, only 10 games managed to change categories.
"Well why don't you just decrease the bound size then, guy?" I hear you asking,
the astute observer would realize that this would then lead to a lower sample size per category, and thus less confident answers. The question becomes, can this be optimized?
Further Exploration - Thoughts on K
The author realizes that the season is just a little over half complete, and intends to follow up this post with the same methodology at the end of the regular season, perhaps with a bound width of ~2.5% as the sample size increases.
This starts to build on another ripe topic, including optimization of the constant k in the Elo equation as the season progresses. k is used to provide "oomph" to the degree in which ratings change after games are played, the higher the k value, the more a win or loss affects a teams rating. It is common practice to change k for the playoffs, but what about mid-season?
Is it correct to assume that k should be constant for a whole season? What happens if k is changed, and are the ways to achieve more accurate predictions by doing so.
Questions for another day.
One of the glaring limitations of this method is that the actual adjustment of the Elo rating applied to the raw rating does not affect the win % chance very much. As a result, the number of times an adjusted rating actually changes the category of win percentage (i.e. a 30-35% chance of win changing to a 35-40% chance) is low. In fact, of the 1364 games scored at the time of writing, only 10 games managed to change categories.
"Well why don't you just decrease the bound size then, guy?" I hear you asking,
the astute observer would realize that this would then lead to a lower sample size per category, and thus less confident answers. The question becomes, can this be optimized?
Further Exploration - Thoughts on K
The author realizes that the season is just a little over half complete, and intends to follow up this post with the same methodology at the end of the regular season, perhaps with a bound width of ~2.5% as the sample size increases.
This starts to build on another ripe topic, including optimization of the constant k in the Elo equation as the season progresses. k is used to provide "oomph" to the degree in which ratings change after games are played, the higher the k value, the more a win or loss affects a teams rating. It is common practice to change k for the playoffs, but what about mid-season?
Is it correct to assume that k should be constant for a whole season? What happens if k is changed, and are the ways to achieve more accurate predictions by doing so.
Questions for another day.
First, what a insightful post.
ReplyDeleteA quick question though;
Your range goes from 30-35, 35-40, by fives.
When the percentage to win was exactly on the cutoff point, which 5% increment did you include that win as?
Say, if the percentage to win was 35%, did you include it in the 30-35, or 35-40 range?