MLB Season Props – Analytics.Bet

To the extreme, I rock a mic like a vandal
Light up a stage and wax a chump like a candle” – Vanilla Ice

“Any player scores 49+ runs?”

“Any player has 77+ hits?”

“Any player hits 20+ home runs?”

How do we price these “any player” props, adjusting for the fact that this year’s MLB season is 60 games instead of the standard 162?

The natural inclination is to take the expectation for a typical season’s league leading total, and prorate it by a factor of 60/162. That would be wrong, for reasons that were well articulated by one of my Twitter followers:

Pretty much every year, there’s someone early in the season who is on pace to hit .400 / 80 home runs / 180 RBI / etc. They never keep it up through the entire season, because random variance decreases as sample size increases. These guys aren’t 80 home run hitters, they’re 50 home run hitters with a run of favourable variance that put them, temporarily, on an 80 home run pace. There are an equal number of 50 home run hitters with a run of unfavourable variance that put them, temporarily, on a 20 home run pace – but they’re not at the top of the leaderboard so we don’t care so much about them. The bets in question are all about extreme values, and shorter seasons mean more extreme values.

So here’s my modeling approach to this problem. Yours may be different, possibly even better…but we’re talking about props that are mostly only available at 5Dimes and with $250 limit, so I’m going “quick & dirty”…meaning that I’m taking some shortcuts to sacrifice a little bit of predictive accuracy to save a lot of time. I built this model in approximately 2 hours.

Let’s pause here to go over some theory, to lay the foundation for what we’re about to do.

Suppose you have 50 shooters each taking 20 shots at a target, and that each shot hits the bullseye with 10% probability.

For any given shooter, the probability of under 5.5 bullseyes can be calculated using the Binomial distribution. In Excel this is =BINOM.DIST(5.5,20,0.1,TRUE) = 98.9%.

Assuming the shooters are independent of each other, the probability that ALL 50 shooters go under 5.5 bullseyes is =BINOM.DIST(5.5,20,0.1,TRUE)^50 = 56.8%.

The opposite of that, 100% – 56.8% = 43.2%, is the probability that not all shooters go under 5.5 bullseyes; that is, the probability that at least one shooter goes over 5.5.

The probability that top score is exactly 6 bullseyes is the probability that at least one shooter goes over 5.5 minus the probability that at least one shooter goes over 6.5. In Excel, =BINOM.DIST(6.5,20,0.1,TRUE)^50-BINOM.DIST(5.5,20,0.1,TRUE)^50 = 31.95%.

We can use this method to get the complete distribution of the top score:

So this is the functional form of the model we’re going to use in our study of MLB props:

P(top score of x) = BINOM.DIST(x+0.5,n,p,TRUE)^z – BINOM.DIST(x-0.5,n,p,TRUE)^z

Where n the number of at-bats, p is the probability of getting a run/hit/home run/whatever in each at-bat, and z is the number of “contenders” who each plays a full season of n at-bats and each has success probability p. Think of n as the 20 shots per shooter, p as the 10% bullseyes per shot, and z as the 50 shooters.

Of all of these components, z is the toughest one to wrap my head around. Just like in the shooter example, the model assumes that everyone has the same number of attempts and that everyone has the same probability of success. There are hundreds of MLB players, but only a small number of those will play the full season AND are skilled enough to contend for the league lead in whatever category is in question. Assuming that each contender is equally skilled, rather than some kind of hierarchy, is one of those “quick & dirty” shortcuts.

Because z is unknown and can vary from season to season, we’re going to treat it as a random variable with its own distribution. I’m going to start with a pool of 50 potential players and pick a subset of those 50 using another binomial distribution – but truncated because z cannot be 0. The subset will depend on two parameters, a “survival” parameter to indicate the probability of playing a full season and a “skill” parameter to indicate the size of the subset that is skilled enough to contend for the league lead in the stat in question.

P(z) = BINOM.DIST(z,50,survival*skill,FALSE) / (1-BINOM.DIST(0,50,survival*skill,FALSE))

For the survival parameter – in 2019, 101 players played 140 or more games. If we assume that the average AL team has 6 full time starting hitters and the average NL team has 5, we can estimate the survival parameter as 101/(30*5.5) = 0.612, so around 61% of full time starters will go the entire year without a significant injury.

We’re also going to assume that a full season consists of 600 at-bats.

So our complete model is:

P(top score of x, conditional on z) = BINOM.DIST(x+0.5,600,p,TRUE)^z – BINOM.DIST(x-0.5,600,p,TRUE)^z

P(z) = BINOM.DIST(z,50,0.612*skill,FALSE) / (1-BINOM.DIST(0,50,0.612*skill,FALSE))

This can be set up easily in a spreadsheet by having each column represent a different value of z and each row represent a different value of x.

This leaves us with two unknown parameters: p and skill. How do we fit them? I used a method called “maximum likelihood estimation” where I use Excel Solver to find the set of parameters that provides the best fit to a set of historical data. For my data set I’m using the last 12 years of league leaders, 2008-2019, roughly corresponding to the post-steroid era in MLB.

Once we have estimated p and skill, the last thing we have to do is convert the model from a 162 game season to a 60 game season:

  • The number of at-bats changes from 600 to 600*60/162 = 222.
  • Because it’s easier to survive a short season than to survive a long season, the injury risk reduced by 60/162, which changes the survival parameter from 0.612 to 0.856.
  • p and skill are unchanged.

P(top score of x, conditional on z, 60 game season) = BINOM.DIST(x+0.5,222,p,TRUE)^z – BINOM.DIST(x-0.5,222,p,TRUE)^z

P(z, 60 game season) = BINOM.DIST(z,50,0.856*skill,FALSE) / (1-BINOM.DIST(0,50,0.856*skill,FALSE))

Ready? Let’s go.

All lines are from 5dimes as of July 21. Note: If you bet these, it’s at your own risk. I’m confident in this model but it’s not perfect, it’s quick & dirty. Feel free to agree or disagree with it, feel free to bet it or not bet it. I’m not a tout, I’m just a math guy.

Hits

Line: Any player has 77+ hits -145 / No player has 77+ hits +115

12 year history:

2019 206
2018 192
2017 213
2016 216
2015 205
2014 225
2013 199
2012 216
2011 213
2010 214
2009 225
2008 213
Average 211.4
Average prorated to 60 games 78.3
2019 leader after 60 games 80

See how the 2019 leader after 60 games had more than the prorated average? That’s evidence of the “fewer games = more variance” phenomenon described in the tweet at the top of this article. It’s going to be a recurring theme as we go. The fact that this prop is lined at 76.5, even below the prorated average (which we know is already too low) is a good sign.

Fitting our model gives:

  • Skill = 0.118. This is a pretty low number, it means that on average there are 50 * 0.612 * 0.118 = 3.6 contenders for the league lead in hits each year. This makes sense because hits are not very random over the course of a full season – it’s very unlikely for a bad hitter or even an average hitter to have a stretch of good luck that’s enough to lead the league in hits for a full year.
  • p = 0.336. This means that each contender will average 0.336 hits per at-bat.

For the full year, we get a distribution like this:

The median is 212, seems in line with the historical numbers, so far so good.

Converting to a 60 game schedule by making the adjustments to the parameters for total at-bats and survival as described above:

The median is 82.5 – that’s where I would put the number if I were the bookie.

Over 76.5 has a projected probability of 87.2%, which at -145 odds yields a tidy +47% EV.

Runs

Line: Any player scores 49+ runs -150 / No player scores 49+ runs +120

12 year history:

2019 135
2018 129
2017 137
2016 123
2015 122
2014 115
2013 126
2012 129
2011 136
2010 115
2009 124
2008 125
Average 126.3
Average prorated to 60 games 46.8
2019 leader after 60 games 53

Skill = 0.185. Runs are a little more random than hits, as is to be expected because scoring runs has a dependency on your teammates to drive you in.

p = 0.191 runs per at-bat.

Median, full season: 126.5 runs.

Median, 60 game season: 50.5 runs.

Probability of over 48.5 runs: 71.2%. At -150 odds, that’s a +19% EV.

Home Runs

Line: Any player hits 20+ home runs -125 / No player hits 20+ home runs -105.

This is a tougher one because there’s been a substantial variation in the home run rate over the past few years due to the composition of the ball and some physics stuff that is way beyond my comprehension (google it). I’m going to run my model as normal, ignoring the ball stuff, but interpret the results with a big grain of salt.

12 year history:

2019 53
2018 48
2017 59
2016 47
2015 47
2014 40
2013 53
2012 44
2011 43
2010 54
2009 47
2008 48
Average 48.6
Average prorated to 60 games 18.0
2019 leader after 60 games 22

Skill = 0.163

p = 0.069 HR per at-bat

Median, full season): 48.5 HR

Median, 60 game season: 20.5 HR

Probability of over 19.5 HR: 62.7%. At -125 that’s a +13% EV. If you think we’re getting the 2019 juiced ball, it’s better than that.

RBI

Line: Any player records 48+ RBI -150 / No player records 48+ RBI +120

Technically, you can’t use our binomial-based model for RBI because you can get multiple of them at one time. Practically, I don’t think it makes a significant difference – so we press on.

12 year history:

2019 126
2018 130
2017 132
2016 133
2015 130
2014 116
2013 138
2012 139
2011 126
2010 126
2009 141
2008 146
Average 131.9
Average prorated to 60 games 48.9
2019 leader after 60 games 53

Skill = 0.143

p = 0.203 RBI per at-bat.

Median, full season: 132 RBI

Median, 60 game season: 52.5 RBI.

Probability of over 47.5 RBI: 88.8%. At -150 that’s a whopping +48% EV.

Stolen Bases

Line: Any player steals 19+ bases -145 / No player steals 19+ bases +115

Again, the binomial model isn’t totally correct because stolen bases aren’t a subset of at-bats. Again, we don’t care. Again, we press on. Quick & dirty, my friends.

12 year history:

2019 46
2018 45
2017 60
2016 62
2015 58
2014 64
2013 52
2012 49
2011 61
2010 68
2009 70
2008 68
Average 58.6
Average prorated to 60 games 21.7
2019 leader after 60 games 21

What I’m more concerned about in applying this model is the general decreasing trend in stolen bases from the Rickey Henderson era through the Moneyball era to the present. So, proceed with extreme caution.

Skill = 0.022. You don’t win the stolen base title by accident. You basically know before the start of the season which one or two guys will definitely win it if they stay healthy. The number of skilled “contenders” is quite low.

p = 0.096 SB per at-bat.

Median, full season: 59 SB.

Median, 60 game season: 22.5 SB.

Probability of over 18.5: 81.6%. At -145 this would be a +38% EV. But I’m passing on this one. With a skill parameter this low, the “randomness” impact that gives us our edge is diminished. The model is giving an edge purely because it’s assuming no change in the stolen base rate over the past 12 years, where the book is assuming there is a change. I agree with the book on this one.

Doubles:

Line: Any player hits 21+ doubles -150 / No player hits 21+ doubles +120

12 year history:

2019 58
2018 51
2017 56
2016 48
2015 45
2014 53
2013 55
2012 51
2011 48
2010 49
2009 56
2008 54
2008 68
Average 52.8
Average prorated to 60 games 19.6
2019 leader after 60 games 21

Skill = 0.358. This is where the randomness really starts to get cranked up. There are players who specialize in home runs, stolen bases, even singles…but not really players who specialize in doubles. Any good hitter has a chance to lead the league in doubles in any given year.

p = 0.070 doubles per at-bat.

Median, full season: 52 doubles.

Median, 60 game season: 22.5 doubles.

Probability of over 20.5 doubles: 79.2%. At -150 this is a +32% EV.

Triples

Line: Any player hits 5+ triples -215 / No player hits 5+ triples +170.

12 year history:

2019 10
2018 12
2017 14
2016 11
2015 15
2014 12
2013 11
2012 15
2011 16
2010 14
2009 13
2008 19
Average 13.5
Average prorated to 60 games 5.0
2019 leader after 60 games 8

Skill = 0.258. More random than home runs, less random than doubles. If you want to lead the league in triples, you’re going to need some wheels.

p = 0.015 triples per at-bat.

Median, full season: 13.5 triples.

Median, 60 game season: 6.5 triples.

Probability of over 4.5 triples: 94.2%. At -215 that’s a +38% EV.

Now, before you rush to bet these, we should discuss some of the limitations of this model and why it may not be a perfect representation of the real world.

Things not considered in the model, that could HURT the overs:

  • A return to the 2018 “dead” ball, especially for home runs / runs / RBI.
  • The strain of a compressed schedule with very little training camp may lead to more injuries.
  • COVID may lead to more games lost due to illness, whether from COVID itself or cold/flu.
  • If a team is out of playoff contention or has clinched a playoff spot, it may sit the regulars for the last few games. In a 60 game schedule that would be a meaningful percentage of the season that would be lost.

Things not considered in the model, that could HELP the overs:

  • A return to the 2019 “juiced” ball, especially for home runs / runs / RBI.
  • DH in the National League.
  • A compressed schedule means that each game is more meaningful, so players may get fewer rest days.
  • The number of “contenders” was assumed to be fixed, but it may be larger in a shorter season due to random variance. Maybe an average hitter CAN lead the league in hits, etc.
  • Because we’re dealing in extreme values, there is an element of antifragility that you get with betting overs. Meaning, any kind of unexpected situation or event will tend to cause the extreme values to be more extreme, rather than less extreme. You can never tell what these surprises are going to be ahead of time, but it’s a good spot to know that they will help you more likely than hurt you!

All in all, I think the “things that could help” at least balance, if not outweigh, the “things that could hurt”.

We’ll check back during and after the season to see how we do on these. If you bet them, good luck!

 

Copyright in the contents of this blog are owned by Plus EV Sports Analytics Inc. and all related rights are reserved thereto.