Tuesday, April 17, 2007

It’s True Because it’s Science!

I was very interested to read about this "mathematician who applies math to real-life situations." Imagine that, applying math to the real world. Such a novel idea almost makes me forget that the field of applied mathematics already exists. Anyway Bruce Bukiet's results in this study are silly enough to confirm there are flaws inherent to the projection. I can just use some simple baseball truths to dispute his conclusion that the Yankees should be projected to win 110 games this year and “dominate baseball.”

Truth 1: There haven’t been any truly dominant baseball teams in the last 5 years, and since the Yankees didn’t acquire too many former MVPs last offseason to stash on the bench, there is no reason to think their flawed roster will suddenly be dominant. From 2002 until now, only one team has won 105 games, and that was the 2004 Cardinals. Parity is up (comprehensive statistical analysis not shown) which means there are less easy teams to play and there are no flawless teams who you would expect to dominate.

Truth 2: It is indeed silly to try to exactly who will be injured when, but projecting everyone to stay healthy is silly as well. At the very least one should average how many games a player has played per year when trying to project a future season from that player. Red Sox management didn’t say “who knows if Pedro will get hurt, let’s sign him up for 5 years.” No, they did the best projection they could based on medical history. I could come up with a statistical study saying the Cubs will win 90 games this year, but how relevant would a projection be that assumes Mark Prior starts 35 games and posts a 2.00 ERA? Just because an entirely healthy Yankees roster producing at 100% could win 110 games (no probably not), doesn’t mean they are likely to this year.

Truth 3: Hitting is 5-10% more important than pitching but you’re not going to win 110 games without strong starting pitching. The Yankees have a dynamic enough lineup where you could say they have enough hitting to win 110, but let’s look at the rotations of the last few 105 game winners to see what Pavano et al. have to live up to.

2004 Cardinals:
Jason Marquis: 32 starts, 201 1/3 innings
Matt Morris: 32 starts, 202 innings
Jeff Suppan: 31 starts, 188 innings
Woody Williams: 31 starts, 189 2/3 innings
Chris Carpenter: 28 starts, 182 innings

2001 Mariners:
Freddy Garcia: 34 starts, 238 2/3 innings
Jaime Moyer: 33 starts, 209 2/3 innings
Aaron Sele: 33 starts, 215 innings
Paul Abbott: 27 starts, 163 innings
John Halama: 17 starts, 110 1/3 innings (including relief)
Joel Pineiro: 11 starts, 75 1/3 innings (including relief)

1998 Braves:
Greg Maddux: 34 starts, 251 innings
Tom Glavine: 33 starts, 229 1/3 innings
Denny Neagle: 31 starts, 210 1/3 innings
Kevin Millwood: 31 starts, 174 1/3 innings
John Smoltz: 26 starts, 167 2/3 innings

1998 Yankees:
Andy Pettitte: 32 starts, 216 1/3 innings
David Cone: 31 starts, 207 2/3 innings
David Wells: 30 starts, 214 1/3 innings
Hideki Irabu: 28 starts, 173 innings
Orlando Hernandez: 21 starts, 141 innings

Just for kicks I’ll also include data from the first 114 games played by the star-crossed 1994 Expos who were on pace to win 105 games before the strike doomed them:
Ken Hill: 23 starts, 154 2/3 innings (on pace for 220 ip)
Pedro Martinez: 23 starts, 144 2/3 innings (on pace for 205 ip)
Jeff Fassero: 21 starts, 138 2/3 innings (on pace for 197)
Kirk Rueter: 20 starts, 92 1/3 innings
Butch Henry: 15 starts, 107 1/3 innings (including relief)

Regardless of raw performance (which actually does vary somewhat among all these pitchers), what is consistent here for the juggernaut teams are rotations that pitch lots and lots of innings. The simplest way to have a rotation shoulder a lot of innings is to keep the original 5 starters healthy, because it’s not easy to find replacement starters who will also give you 7+ innings every time out. In other words, you need most all your pitchers healthy to have a dominant year. This is accomplished generally by getting really lucky because no one has yet figured out how to keep starters healthy. The trickle down effect is known however; when you have 3 or more starters throwing 200+ innings then it greatly lessens the workload on the bullpen, which helps the quality relievers stay healthy.

To quantify, here are the percentages of these teams’ games starter by the core 5 starters (or in the case of the Mariners, the core 6, as both Halama and Pineiro supplied quality starts as a tag team).

2004 Cardinals: 95%
2001 Mariners: 96%
1998 Braves: 96%
1998 Yankees: 88%
1994 Expos: 89%

Let’s consider the Yankees theoretical best case rotation for this year:
Mike Mussina: 2 starts, 6 innings, currently on DL
Chien-Ming Wang: 0 starts, 0 innings, currently on DL
Carl Pavano: 2 starts, 11 1/3 innings, currently on DL
Andy Pettite: 3 starts, 17 innings
Kei Igawa: 2 starts, 10 1/3 innings

So far they’ve started 81% of the Yankees’ games but that number will go down even farther soon. I’ll go ahead and ruin the ending for you, Pettite might throw 200 innings, but no one else on this list will be close. Also checking the roster depth charts quickly, it’s evident that the Yankees don’t have a boatload of quality starters and relievers at in the waiting as replacements.

Conclusion: The Yankees are as flawed as they were last year when the majority of baseball analysts predicted them to dominate baseball and win the World Series, while I wrote this instead. At this point to win 110 games, they’d have to go 105-46 the rest of the season (nearly a 0.700 win percentage) which absolutely won’t happen considering the state of their pitching and that there are at least 8 other possible 90 win teams in the AL. The Yankees will probably slug their way to 95 wins, but 110 is completely out of the question. Tell your statistics to shut up math guy!

Update: I did some more research on Bruce Bukiet and turns out he uses the Markov Chain method of statistical prediction, interestingly enough one of the very same methods implemented by my advisor here at MIT to model hurricane tracks. This works for hurricanes because they don’t “know” what has happened to them in the past, whereas a baseball player’s body and mind does. For instance, I once saw a study (reference missing) which simulated a baseball season in which every hitter was given a 27% chance of getting a hit every time at bat. At the end of the year there were many hitters in this simulation who had 0.320+ batting averages and even some well above that. So why don’t we in practice see many (if any?) career .270 hitters suddenly win batting titles due to the fluctuations in the final results of a series of somewhat random events? One effect is that hitters inevitably eventually slump, get frustrated, and over that time don’t produce as well as they should. It’s the same reason that many very skilled poker players don’t maximize their win rate. A bad week can make anyone play a little worse. Another effect is that a very hot hitter will be pitched to differently (or not pitched to altogether), making it more difficult to maintain his average. Both these, and probably many more effects, make regression to the mean happen in baseball faster than it statistically should. This is why I don’t think Markov Chains should be used when trying to predict the activities of individual humans. I like a PECOTA method much better, since it uses raw data produced by humans, not simulated data produced by logic boxes. I also can’t seem to find what kind of injury projections he includes, if any, so I’m not changing the wording of my “Truth 2” paragraph.

Labels: , ,

Sunday, April 08, 2007

Last Year's One Run Anomalies

Statistics suggest that as a rule, over the course of the season teams will go .500 in games decided by one run. Every year however there are some teams that (either due to luck or perhaps an absurdly bad/good bullpen) tend to dramatically overperform or underperform in that category. For example, the Washington Nationals in their first year led their division for almost half a season, and this was largely due to a ridiculously fortunate record in one-run games. As the season progressed, their numbers regressed to the mean (ending at 30-31 in 1-R games) and they returned to the bottom of the division.

So, here are teams that last year in 2006 had a record well above or below .500 in one-run situations. While not a guarantee, keep an eye on them because they may very well win more or less if their 2007 1-R outcomes come back to earth.

Overperformers:

Toronto 20-10 (0.667)
Boston 29-20 (0.592)
Minnesota 20-11 (0.645)
Oakland 32-22 (0.593)
NY Mets 31-16 (0.66)

Underperformers:

Cleveland 18-26 (0.409)
Kanasa City 14-24 (0.368)
Texas 17-26 (0.395)
Atlanta 19-33 (0.365)

Where these results might be felt the most this season is in the NL East. Both the Braves and the Mets had two of the biggest 1-R disparities last season. If both come back to the mean this season, and considering the Phillies were right around 0.500 last year, we may have quite the divisional race on our hands. The 1-R factor may also come into play in the AL wild card race this season, where many potential playoff teams had aberrational 1-R records.

Labels:

Thursday, April 05, 2007

Returning to the Fold (Part I)

A funny thing about the last couple years in MLB was that many of the very best pitchers were either inactive or too ineffective to be considered the ace of their team. All of these guys fit that description and I hope very much that their strong first starts are a sign that they'll be able to return to their peak performance level over the full 2007 season. All of them are very young (except Hudson who is 31) so they still have time to rise to the top and stay there for the foreseeable future. Hopefully I'll be writing a similar blurb for Francisco Liriano after his first start of 2008.

Rich Harden:
Coming Back From: Missing most of last year after an injury-plagued 2005.
Aspires to Be: John Smoltz.
First Start: 7 ip, 3 h, 0 er, 7 k
Best Sign: Looks and feels healthy.

Zack Greinke:
Coming Back From: Uncertain circumstances, but he only pitched a couple times out of the pen last year after getting hammered all 2005. It was only the inept team behind him that hid how amazing he was for it being his first year in 2004.
Aspires to Be: Bret Saberhagen.
First Start: 7 ip, 8 h, 1 er, 7 k
Best Sign: Didn’t give up 10 runs. And he’s still only 23.

Josh Beckett:
Coming Back From: Disappointing first 200 ip year. He's already been at the top in the NL. Now he just needs to figure out how to do it in the AL.
Aspires to Be: Curt Schilling.
First Start: 5 ip, 2 h, 1 er, 5 k
Best Sign: Has finally admitted some willingness to adapt.

Ben Sheets:
Coming Back From: Two straight injury-filled years.
Aspires to Be: Darryl Kile with a better fastball.
First Start: 9 ip, 2 h, 1 er, 3 k
Best Sign: His extremely efficient, Roy Halladay-esque first outing.

Jake Peavy:
Coming Back From: Less effective 2006 season (see earlier post).
Aspires to Be: Kevin Brown circa 1998.
First Start: 6 ip, 3 h, 0 er, 6 k
Best Sign: Getting ahead of hitters and seems to be completely healthy.

Tim Hudson:
Coming Back From: A bad 2006 which followed an unimpressive 2005.
Aspires to Be: Tim Hudson circa 2000.
First Start: 7 ip, 2 h, 1 er, 5 k
Best Sign: Movement on pitches seems to be back.

Felix Hernandez:
Coming Back From: A disappointing first full season (disappointing in that he didn’t win the AL Cy Young at age 20).
Aspires to Be: Dwight Gooden or Pedro Martinez with more stamina.
First Start: 8 ip, 3 h, 0 er, 12 k
Best Sign: Has come back with stronger determination and physique. His stuff belongs in some league a level above MLB. Despite being brought along slowly given his minor league domination, he's still only 20 years old entering his third MLB year.

Labels: ,