Here’s an email exchange with Ryan Marston, who shares none of my sports allegiances, but does share my passion for objective sports analysis, and even some of my DNA (we’re first cousins). The initial topic (Madison Bumgarner’s strange outing on June 21) is a bit outdated, but the theories discussed are timeless.
Last night, Madison Bumgarner gave up eight runs on nine hits in 1/3 of an inning. He struck out one (Carl Pavano, trying desperately to record Minnesota’s ninth straight hit to open the game), walked none, and gave up no home runs. That gives him a 0.00 FIP, as the only outcome entirely within his control was a positive one, the strikeout. His ERA, on the other hand, was 216.00, as all eight runs were earned.
Fangraphs’ WAR calculation is based on FIP, so last night’s outing will actually increase Fangraphs’ perception of his value. Baseball-reference’s calculation uses runs scored, so his bWAR will take a big hit after last night. Obviously, the true measure of Bumgarner’s outing is somewhere between FIP’s estimate, which says he was perfect, and ERA’s, which says he was worse than Alfredo Aceves, who walked five batters in a row with two outs in the second.
How do we bridge this gap? Can we take anything from last night’s events in our effort to establish a single statistic that best estimates a pitcher’s value?
This is an interesting question, and a good example of why there really is (and probably can’t be) a perfect pitching stat — either you put too much weight on true outcomes, or not enough.
If you used xFIP in Bumgarner’s calculation instead of FIP, it would probably be a good start, if I understand it correctly (that it counts fly balls as “possible” home runs, instead of actual homers) since four of the nine balls in play were fly balls. And since three of those four were line drives, there was a good chance of him getting knocked around a little bit regardless of defense or luck.
My best guess as to what would improve the formula would be to temper the true outcomes with potential ones. If we can say how good a pitcher is when everything is under his control (FIP), and we know when it’s out of his hands, balls will land for hits about 30% of the time, why not combine the two? Regress to the mean, so we don’t put too much stock into defense or luck?
In this case, that would mean assuming that only two or three of the balls the Twins put in play against Bumgarner would become hits, instead of all nine. Which may or may not be true; there’s only so much data we can extract from a box score. On a luckier day, it’s possible all nine balls would land in someone’s glove, and we’d be hearing Jon Miller say that “Bumgarner’s perfect through three.” But maybe if we can take a closer look at FB/LD/GB data, instead of just assuming .300 BABIP, we can parse out what kind of day the guy was actually having, and what kind of pitcher he is.
Great call on xFIP, which adjusts FIP to account for expected home runs based on fly ball rates. Baseball Prospectus’s SIERA takes the concept even further, crediting pitchers for high ground ball rates only if they have high walk rates and rewarding high fly ball rates only if they have high strikeout rates. It’s a much more involved formula, and demands more knowledge than I have of certain types of pitchers and the consequences of their most common outcomes and tendencies, but it inches closer to the potential outcomes you’re looking for.
I agree that no single stat can be perfect because there is no perfect balance between true outcomes and actual outcomes, but SIERA (or Skill-Interactive ERA) seems to be as close as anyone has come. Unfortunately, I can’t find a SIERA leaderboard to tell me what effect the game in question had on Bumgarner’s SIERA (I’m not a BP subscriber).
Your analysis was so spot-on that I think you missed the easy answer, which is not to trust any metric in a one-game sample size. If every pitcher pitched 10,000 innings, ballpark and era factors notwithstanding, I think we would start to see all these percentage metrics lining up. Jon Lester would have a 3.10 ERA with a 3.10 FIP and ERA+ and xFIP and SIERA and whatever else measures run prevention and is scaled to ERA. Joe Blanton would be at 4.25 in every metric and Aneury Rodriguez would be a 4.80 pitcher, as everyone would have the same percentage of ground balls finding gloves and everyone’s HR/FB rate would normalize. Halfway through this year, Bumgarner’s fWAR (which is 2.1, by the way) is far better than his bWAR (0.6). By the end of this season, there will still be a difference, but only because of the severity of his luck (or hittability) in one game. When Bumgarner retires, this game will have had no material effect on his numbers and his ERA and his FIP will be within a few hundredths of each other.
One reason I bring this up is because I’ve tried to sell Season Score as a middle ground between the two WARs. I wish I could point to Bumgarner’s Game Score (a grisly 2) and say that it accurately placed it between some pitcher’s slightly worse outing and another pitcher’s slightly better one, but Game Score isn’t perfect (and I’m sure Bill James never intended for it to be). It’s not the worst Game Score of the season (Jaime Garcia dropped a -6 in late May, but that took a few more bad innings), but I believe it’s the worst score since that game.
Anyway, if there’s one major shortcoming of Season Score, it’s that it leans too heavily on actual outcomes, and differentiates between various types of success/failures rather arbitrarily. A hit deducts two points while a walk only deducts one, which may make some sense in that a hit could be a double or a homer, but the walk is more likely the pitcher’s fault, and should probably be penalized at least as harshly. I love the extra point for each strikeout, but it pales in comparison to the two-point bonus for finishing any inning after the fourth. An NL pitcher might get an early hook because his spot is due up in the batting order, costing him five points for an inning he could have otherwise completed, and he can’t even make up those five points by adding three Ks to his ledger.
I may be reaching here, but are there any lessons Bumgarner’s outing can teach us about the value of Season Score?
I do think Season Score is a valuable middle ground. It obviously can’t tell us everything about a pitcher, but does a much better job than, say, Quality Starts, at telling us how valuable a pitcher’s starts have been of the course of a long year or career. (Which makes me wonder: have you figured out who the leaders in “Career Score” are?)
Of course, Vin Mazzaro is usually a starter, with a Season Score of -3 over four starts this year. Against the Indians in May he only made seven outs on his way to giving up 14 runs, all of them earned. Surely that would have done a number on his Season Score had Kyle Davies not started that day. (His two runs allowed in the first third of an inning only brought his down five points.)
I’m sure there are plenty of examples of the opposite happening: some guy like Livan Hernandez throwing a complete game shutout despite not having his best stuff. Lots of fly outs and line drives into the shortstop’s glove might not be the ideal way to get through a game, but it does the job. Much like a quarterback being penalized for a interception after hitting his stone-handed receiver right on the fingertips, or credited with a touchdown for a terrible throw into traffic, it’s always be difficult to assign credit or blame to an individual in any sport. The fact is, over a long season the best and most durable pitchers will find their way to the top of that Season Score list, and I think it’s as good a way as any to measure someone’s successes by combining outcomes that are controllable and predictable with ones that aren’t.
I’m glad you’re on board with Season Score. I’d love to take a look at Career Score, but I think I’d need your help with the technology. Park adjustments and a more accurate replacement level for Season Score will come first.
Mazzaro’s case is another one in which Season Score falls short of telling the whole story, but as Bumgarner’s case teaches us (and as you mention), no stat will every perfectly measure all of a pitcher’s contributions.
I’ll end this debate with another question, this one more rhetorical than answerable. What’s more valuable in a pitching statistic: accuracy or accessibility? I think we both agree that SIERA is a better measure of a pitcher’s effectiveness than ERA, FIP, Season Score, and probably the two versions of WAR, but that’s based largely on the assumption that the formula works. FIP and FIP+ are conveniently scaled to ERA, but will there ever be a time when the average fan can look at a few box scores and figure out a pitcher’s FIP in his head, the way many fans can with ERA?
I’m more likely to get behind a stat that considers more variables than “earned” runs, gives more credit to pitchers who pitch more innings, and values strikeouts more than other outs, but the more elaborate the formula becomes, the more the general public will dismiss the results. Tell someone that Zack Greinke has been among the best pitchers in the NL this year because he has a great SIERA and you’re bound to hear “yeah, but his ERA is almost 6, idiot”. I’m not saying we need to cater to the unwashed masses, but there’s got to be some value to transparency.
Thanks for taking the time to argue with me. I’ll let the readers take it from here.