Tuesday, April 17, 2007

It’s True Because it’s Science!

I was very interested to read about this "mathematician who applies math to real-life situations." Imagine that, applying math to the real world. Such a novel idea almost makes me forget that the field of applied mathematics already exists. Anyway Bruce Bukiet's results in this study are silly enough to confirm there are flaws inherent to the projection. I can just use some simple baseball truths to dispute his conclusion that the Yankees should be projected to win 110 games this year and “dominate baseball.”

Truth 1: There haven’t been any truly dominant baseball teams in the last 5 years, and since the Yankees didn’t acquire too many former MVPs last offseason to stash on the bench, there is no reason to think their flawed roster will suddenly be dominant. From 2002 until now, only one team has won 105 games, and that was the 2004 Cardinals. Parity is up (comprehensive statistical analysis not shown) which means there are less easy teams to play and there are no flawless teams who you would expect to dominate.

Truth 2: It is indeed silly to try to exactly who will be injured when, but projecting everyone to stay healthy is silly as well. At the very least one should average how many games a player has played per year when trying to project a future season from that player. Red Sox management didn’t say “who knows if Pedro will get hurt, let’s sign him up for 5 years.” No, they did the best projection they could based on medical history. I could come up with a statistical study saying the Cubs will win 90 games this year, but how relevant would a projection be that assumes Mark Prior starts 35 games and posts a 2.00 ERA? Just because an entirely healthy Yankees roster producing at 100% could win 110 games (no probably not), doesn’t mean they are likely to this year.

Truth 3: Hitting is 5-10% more important than pitching but you’re not going to win 110 games without strong starting pitching. The Yankees have a dynamic enough lineup where you could say they have enough hitting to win 110, but let’s look at the rotations of the last few 105 game winners to see what Pavano et al. have to live up to.

2004 Cardinals:
Jason Marquis: 32 starts, 201 1/3 innings
Matt Morris: 32 starts, 202 innings
Jeff Suppan: 31 starts, 188 innings
Woody Williams: 31 starts, 189 2/3 innings
Chris Carpenter: 28 starts, 182 innings

2001 Mariners:
Freddy Garcia: 34 starts, 238 2/3 innings
Jaime Moyer: 33 starts, 209 2/3 innings
Aaron Sele: 33 starts, 215 innings
Paul Abbott: 27 starts, 163 innings
John Halama: 17 starts, 110 1/3 innings (including relief)
Joel Pineiro: 11 starts, 75 1/3 innings (including relief)

1998 Braves:
Greg Maddux: 34 starts, 251 innings
Tom Glavine: 33 starts, 229 1/3 innings
Denny Neagle: 31 starts, 210 1/3 innings
Kevin Millwood: 31 starts, 174 1/3 innings
John Smoltz: 26 starts, 167 2/3 innings

1998 Yankees:
Andy Pettitte: 32 starts, 216 1/3 innings
David Cone: 31 starts, 207 2/3 innings
David Wells: 30 starts, 214 1/3 innings
Hideki Irabu: 28 starts, 173 innings
Orlando Hernandez: 21 starts, 141 innings

Just for kicks I’ll also include data from the first 114 games played by the star-crossed 1994 Expos who were on pace to win 105 games before the strike doomed them:
Ken Hill: 23 starts, 154 2/3 innings (on pace for 220 ip)
Pedro Martinez: 23 starts, 144 2/3 innings (on pace for 205 ip)
Jeff Fassero: 21 starts, 138 2/3 innings (on pace for 197)
Kirk Rueter: 20 starts, 92 1/3 innings
Butch Henry: 15 starts, 107 1/3 innings (including relief)

Regardless of raw performance (which actually does vary somewhat among all these pitchers), what is consistent here for the juggernaut teams are rotations that pitch lots and lots of innings. The simplest way to have a rotation shoulder a lot of innings is to keep the original 5 starters healthy, because it’s not easy to find replacement starters who will also give you 7+ innings every time out. In other words, you need most all your pitchers healthy to have a dominant year. This is accomplished generally by getting really lucky because no one has yet figured out how to keep starters healthy. The trickle down effect is known however; when you have 3 or more starters throwing 200+ innings then it greatly lessens the workload on the bullpen, which helps the quality relievers stay healthy.

To quantify, here are the percentages of these teams’ games starter by the core 5 starters (or in the case of the Mariners, the core 6, as both Halama and Pineiro supplied quality starts as a tag team).

2004 Cardinals: 95%
2001 Mariners: 96%
1998 Braves: 96%
1998 Yankees: 88%
1994 Expos: 89%

Let’s consider the Yankees theoretical best case rotation for this year:
Mike Mussina: 2 starts, 6 innings, currently on DL
Chien-Ming Wang: 0 starts, 0 innings, currently on DL
Carl Pavano: 2 starts, 11 1/3 innings, currently on DL
Andy Pettite: 3 starts, 17 innings
Kei Igawa: 2 starts, 10 1/3 innings

So far they’ve started 81% of the Yankees’ games but that number will go down even farther soon. I’ll go ahead and ruin the ending for you, Pettite might throw 200 innings, but no one else on this list will be close. Also checking the roster depth charts quickly, it’s evident that the Yankees don’t have a boatload of quality starters and relievers at in the waiting as replacements.

Conclusion: The Yankees are as flawed as they were last year when the majority of baseball analysts predicted them to dominate baseball and win the World Series, while I wrote this instead. At this point to win 110 games, they’d have to go 105-46 the rest of the season (nearly a 0.700 win percentage) which absolutely won’t happen considering the state of their pitching and that there are at least 8 other possible 90 win teams in the AL. The Yankees will probably slug their way to 95 wins, but 110 is completely out of the question. Tell your statistics to shut up math guy!

Update: I did some more research on Bruce Bukiet and turns out he uses the Markov Chain method of statistical prediction, interestingly enough one of the very same methods implemented by my advisor here at MIT to model hurricane tracks. This works for hurricanes because they don’t “know” what has happened to them in the past, whereas a baseball player’s body and mind does. For instance, I once saw a study (reference missing) which simulated a baseball season in which every hitter was given a 27% chance of getting a hit every time at bat. At the end of the year there were many hitters in this simulation who had 0.320+ batting averages and even some well above that. So why don’t we in practice see many (if any?) career .270 hitters suddenly win batting titles due to the fluctuations in the final results of a series of somewhat random events? One effect is that hitters inevitably eventually slump, get frustrated, and over that time don’t produce as well as they should. It’s the same reason that many very skilled poker players don’t maximize their win rate. A bad week can make anyone play a little worse. Another effect is that a very hot hitter will be pitched to differently (or not pitched to altogether), making it more difficult to maintain his average. Both these, and probably many more effects, make regression to the mean happen in baseball faster than it statistically should. This is why I don’t think Markov Chains should be used when trying to predict the activities of individual humans. I like a PECOTA method much better, since it uses raw data produced by humans, not simulated data produced by logic boxes. I also can’t seem to find what kind of injury projections he includes, if any, so I’m not changing the wording of my “Truth 2” paragraph.

Labels: , ,

3 Comments:

Blogger Ben said...

Roger Clemens?

3:19 AM  
Blogger John said...

yeah i was thinking i should include clemens, but eh, i can't project him as more than a 5-6 inning pitcher in the AL East anyway.

12:09 PM  
Blogger John said...

yup, clemens returns.

mussina, pettite, wang, hughes, clemens is alllllright

1:28 AM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home