Introducing Season Scores

When evaluating hitters, there are a few great metrics we can use to determine which hitters had the best season. OPS is simple and effective. Runs Created addresses OPS’s shortcomings, and Weighted On Base Average tells us as much as Runs Created and is scaled to batting average, so it’s easy to understand. If we want to include defense, we can look to VORP, win shares, or WAR and get a pretty good idea who the best players were without looking much farther.

What’s the best single metric to measure the effectiveness of a pitcher over the course of a season? Thirty years ago, we might have said wins, which were context-dependent even then, but in today’s world of specialized relievers and obsession over pitch counts, it’s downright absurd to think a win is a direct product of a starting pitcher’s effectiveness.

Five years ago, we would have said ERA, as the pitcher’s primary objective is run prevention, and ERA tries to isolate the runs that were the pitcher’s responsibility by ignoring runs scored as a result of a fielder’s error(s). I see two major problems with this. First, any rate stat where a lower score is better ignores volume. If one pitcher throws 160 innings with a 3.25 ERA and another throws 220 innings with a 3.28 ERA, it’s hard to make a case that the pitcher with the lower ERA was more valuable or more effective.

The second problem with ERA, as has become more clear in recent years, is that a pitcher has little control over the result of balls in play. An effective pitcher strikes out more hitters, walks fewer, and gives up fewer home runs. Everything else is largely a function of opposing hitters’ abilities, the defense behind the pitcher, and luck. This is the premise behind FIP, or fielding independent pitching, which isolates the three outcomes within a pitcher’s control and scales it to ERA, effectively predicting a pitcher’s ERA in a defense-neutral environment and with average luck. The problem with FIP, though, is just that- it’s more predictive than reflective. If a pitcher induces more groundouts, gives up fewer singles, and wriggles out of jams when he gets in trouble, he was successful at preventing runs, but FIP understates this success.

Today’s best measure of a player’s contribution to his team’s wins is wins above replacement, or WAR. However, the two keepers of WAR, fangraphs (fWAR) and Sean Smith, whose bWAR is available at baseball-reference, disagree as to what components of a pitcher’s record should be reflected in WAR. Fangraphs bases its pitcher WAR on FIP, essentially evaluating a pitcher’s talent and attempting to predict his future success by ignoring balls in play. Sean Smith uses runs allowed, more accurately reflecting true outcomes, but ignoring fielders’ contributions to those runs. According to fangraphs, Cliff Lee was the best pitcher in baseball in 2010, worth 7.1 wins above replacement. Baseball-reference has Lee worth 4.3 WAR, outside the top ten in the American League. Both are valuable tools, but it’s hard to consider one the preeminent pitching evaluation metric when they attempt to reach the same conclusion by measuring the same performance so differently.

That brings us to my choice for the best metric to evaluate a pitcher’s season: Season Score. In an attempt to evaluate a pitcher’s success in a single game, Bill James invented Game Score, which gives a starting pitcher 50 points, then adds:

-One point for each out recorded,
-Two additional points for each inning completed after the 4th, and
-One point for each strikeout.

Then subtracts:

-Two points for each hit allowed,
-Four points for each earned run allowed,
-Two points for each unearned run allowed, and
-One point for each walk.

It’s a simple, fairly unsophisticated statistic that reflects essentially all the important things a starting pitcher sets out to accomplish, namely:

-Pitching as deep into the game as possible,
-Striking out hitters, which puts nearly 100% of the run prevention burden on the pitcher, rather than the fielders,
-Not walking anyone, and
-Preventing runs of any kind.

So why wouldn’t the sum of a pitcher’s Game Scores, or his Season Score, be the most effective way to measure his success over the course of a season? It’s better than wins, because it’s not dependent on the pitcher’s team scoring runs for him. It’s more telling than ERA because it gives extra credit for pitching more games and deeper into games. It’s a better measure of effectiveness than FIP because it measures actual results, rather than punishing Matt Cain-type pitchers who get outs without a lot of strikeouts. By counting all outs and giving extra credit to strikeouts and outs later in games, it’s actually a happy medium between bWAR’s reflective quality (more outs equals more success) and fWAR’s predictive quality (more strikeouts suggest more future sucess).

Does Season Score have its shortcomings? Sure. It’s biased toward pitchers’ parks, but that can be adjusted at year-end, the same way OPS+ and ERA+ are park-adjusted. I don’t love that it considers hits twice as damaging to a pitcher’s record as walks, but that does reflect that a walk always gives a runner one base, while a hit can be a double or triple, or a single that moves a runner from first to third or scores him from second. I’m also not a huge fan of the distinction between earned runs and unearned runs, as it requires a subjective decision and essentially assumes that runs not the direct result of an error are all created equal. But again, we’re looking for a balance between predictive statistics like FIP and more context-dependent measures of effectiveness like ERA, and weighting earned runs and unearned runs differently accomplishes just that.

One more problem with adding Game Scores to arrive at Season Score is that the 50-point baseline seems arbitrary. I assume Mr. James’s intention in starting at 50 points was to keep most scores positive (it would take something like eight earned runs on 11 hits and two walks with no strikeouts in two innings to get down to zero) and to make a 100 something like a perfect game (the highest Game Score ever recorded, Kerry Wood’s 20-K, one-hit shutout in 1998, was a 105). If two pitchers are in contention for a Cy Young Award and one threw 33 starts and another threw 34, Season Score as we’ve defined it would give a 50-point bonus to the guy with one more start. We want to encourage volume, but not arbitrarily.

This is where replacement level comes into play. We should assume the pitcher who made 33 starts left his team with one more start from a minor league call-up or a reliever capable of stretching out a few innings. To calculate replacement level, I reviewed a sampling of pitchers who started the sixth or seventh most games for average teams in 2010 and determined that they averaged a Game Score of 45. This makes sense, as five innings of three-run, six-hit ball with three strikeouts and a walk seems like a reasonable expectation of a top prospect or a long reliever making a spot start. Therefore, to calculate Marginal Game Score, I simply subtracted 45 points from every game score before adding them.

Let’s put this new number to use. Last year’s AL Cy Young race was highly contentious, as Felix Hernandez was clearly the most effective pitcher in the league, but wasn’t necessarily the most successful, as he went just 13-12 despite a league-leading 2.27 ERA and 232 strikeouts in 249 2/3 innings. Season Score agrees that Hernandez was the best pitcher in the league, scoring him at 627 points (his average Game Score was 63.4, 18.4 points above replacement level, times 34 starts). Jered Weaver, who led the league in strikeouts, finished second with a Season Score of 502, followed by David Price at 440, Cliff Lee at 438, and CC Sabathia at 430. I haven’t adjusted these amounts for park effects, and doing so would clearly benefit Sabathia, who pitched in an environment that inflated scoring by almost 18%, while the others pitched most of their games in pitchers’ parks, but King Felix clearly stands head and shoulders above the field.

In the National League, Season Score again agrees with the writers’ choice, as Cy Young winner Roy Halladay scores a 594, better than Adam Wainwright’s 577. Third place Ubaldo Jimenez’s 516 was effected heavily by Coors Field’s 36% scoring hike, closing the gap on the leaders.

Now that we’re two-for-two in identifying each league’s best pitcher (assuming the Cy Young voters were right, which I think they were last season), let’s take a look at 2011’s top performers through Monday’s games. In the National League, Tim Lincecum’s ten-strikeout, three-hit gem last night vaulted him past Roy Halladay and Josh Johnson with a Season Score of 90. Halladay (74) and Johnson (70) each pitch tonight, and either can take the lead with a strong start. This seems to match the naked eye, as Lincecum is 2-1 with a 1.67 ERA, a WHIP below 1, and a league-leading 32 strikeouts in 27 innings. Halladay and Johnson have even better rate stats in one fewer start.

In the American League, it should come as no surprise that the Angels’ two aces, Dan Haren (108) and Jered Weaver (101) top the Season Score leaderboard by a fair margin. Each has an ERA under 1.30 and a WHIP under .80, and while Weaver leads the league with 31 strikeouts, Haren’s 27/2 K/BB ratio is even more impressive. Josh Becket is a distant third with a 71 Season Score, averaging a lower Game Score per start (23.7) than Haren (27.0) or Weaver (25.3).

In future posts, I hope to revisit a few other controversial Cy Young Award races using Season Scores, explore the mechanics behind adjusting Season Scores to account for park effects, and continue to update the Season Score leaderboards for each league. In the mean time, I welcome comments as to whether you agree that Season Score is the most effective single metric to evaluate a pitcher’s effectiveness, what shortcomings I’ve missed, and how else we can improve on the Season Scores formula or its uses.

5 Responses to Introducing Season Scores

bob says:

April 22, 2011 at 1:30 am

Glad to see someone else who appreciates Game Scores. Last year I ranked pitchers by average game score, and it is amazing how well it highlights good pitchers. I was figuring out the top 30 by game score, What I like is how you can take a game score and use Bill James calculation of expected wins to give an average expectation of wins due to starting pitching.

I have the 87 Abstract where he first proposed it, so I’ve used it a long time. It is nice to see ESPN posts it in their box scores. Although that isn’t crucial, it is so easy to calculate, I’ve got each inning total memorized, 5=67, 6=72. 7=77 etc, add one point per K & extra out; multiply hits, runs & earned runs by 2, add one for each BB, subtract. What could be easier. A lot easier than calculating ERA in your head. One simple number, very useful.

One problem with numbers of pitching performances is that if you divide 162 games by 5 it comes out as 32.4 so 34 starts seems high.

Pingback: Ranking Rotations | Replacement Level Baseball Blog
Pingback: Start the Weaver Watch | Replacement Level Baseball Blog
Chad says:

May 24, 2011 at 7:00 pm

Ah, yes, but then what about when the Cy Young goes to a reliever? Or when one of the ones who was “wronged” was a reliever? (i.e. the 2005 AL Cy Young race, when Bartolo Colon and his 21 wins beat out Mariano Rivera). What do you do then?

- Bryan says:
  
  May 24, 2011 at 8:15 pm
  
  As relievers are currently used, I don’t think one is ever a reasonable Cy Young candidate. Back when Mike Marshall-types pitched 175 relief innings in a season, there was a reason to consider them, but an 80-inning guy, even if he saves 50 games with a 1 ERA and 100 strikeouts, is not as valuable as a 200-inning starter with a 3 ERA. It was Johan Santana who should have won the 2005 Cy Young.