Industry Spotlight: Baseball (Sports) Statistics - Statistics.com: Data Science, Analytics & Statistics Courses

The U.S. baseball season opens Thursday, March 28, and celebrates the 48th season of analytics in baseball, beginning with the founding of the Sabermetric Society in 1971 (the same year that Satchel Paige entered the Hall of Fame). Analytics has come a long way in sports, and now has its own conference, the MIT Sports Analytics Conference. This is an outlier in the world of statistics conferences, with its sky-high registration fees, its sponsorship by ESPN, and speakers like Malcolm Gladwell.

The attention-grabber in this year’s conference, several weeks ago, was curling. Yes, curling! At a casual first glance, this game, in which participants slide blocks along an ice lane, would seem to be an unlikely candidate for sophisticated analytics. And a casual read of the research paper on curling analytics presented at the conference would suggest that this is not a sport open to quick study. Consider the following excerpt, for example:

“Teams with hammer who attempted to place their stones fully behind centre guards early in ends gave up more steals…”

Baseball can be equally impenetrable, but is more quantitative – I once heard this summation of a game situation on the radio:

“2 to 2. 2 on, 2 out, 2 in, 2 and 2.”

Translation: The score is tied at 2, there are two runners on base, two outs, and two runs have scored so far in the inning. The count on the batter is 2 balls and 2 strikes. (The announcer, Jon Miller, is a master of the English language and the spoken story; he took advantage of the situation to tell a story using only the number 2 and prepositions.)

The application of statistical and quantitative analysis to baseball has made steady advances over the years; the publication of Michael Lewis’s “Moneyball” in 2003 was an important milestone. Moneyball described the Oakland Athletics’ systematic statistical approach to evaluating talent, and the cost-benefit approach it used in making player acquisition decisions.

The role of analytics in sports, and baseball in particular, has received growing attention, but there are really several flavors.

1. Mainly descriptive. Does this batter tend to hit fly balls or ground balls? In which direction?

2. Analytics for optimization (often involving simulation). Which will enhance the chances of winning more – upgrading the shortstop position with a $20 million superstar, or spending that money on two decent relief pitchers?

3. Distinguishing noise from signal. As more and more numbers are crunched and statistics are produced, the risk grows that teams and players will be trying to act on detailed data whose real value may be questionable.

#2 was the first flavor of analytics to gain real traction in the baseball world. Player evaluation and acquisition involves huge sums of money, and management welcomed the light that analytics could shed. Moreover, the application of analytics in this sphere did not require that the players and coaches understand anything about it.

A Statistical “Anti-Analytics” Strategy?

The central matchup in baseball is that between the pitcher and batter, and the batter’s success is significantly enhanced if he can successfully guess what type of pitch is coming (pitches differ by location, speed, and spin (which determines trajectory). Studying a pitcher’s choice of pitch, both as a function of the batter and as a function of the situation, may give the batter an idea of what pitch is coming up in a given setting. But suppose the pitcher, after ruling out the pitches that are clearly unacceptable in a given situation, were to choose his next pitch randomly? Wouldn’t this take away the batter’s analytics advantage?

I asked Ben Baumer, former statistician for the NY Mets, now a professor at Smith College and instructor for our SQL course, about this. He agreed that random pitch selection would be an optimal strategy, but asked “how would you execute it?” After some thought, I realized that this is indeed a significant problem. How does the catcher (who calls the signals for the pitches) obtain a random number or other signal? Roll a die on home plate? And map that signal to the set of feasible pitches, which may change from one setting to the next.

Back to curling…

Curling’s appeal for analytics lies in its sequential series of choices that are both quantifiable and uncertain. Both teams in turn slide their blocks towards a target, with scoring determined by the proximity of a team’s blocks to the center of the target at the conclusion of a round of play. The quantifiable dimensions of play include the paths of the blocks, their possible final positions, the possibility of knocking another block out or of being knocked out, and the possibility of of a block serving to impede other blocks access. The variables (including the need to look out one, two, three, etc. moves ahead) are complex enough that systematic modeling can provide information about optimal strategies.