Tuesday, September 29, 2015

Batting Average / BABIP Gap

Albert Pujols is having an odd season. According to reporter Pedro Moura:
As of press time (after September 28th's games), he's batting .237. That's a -26 point gap between his average and BABIP.

As a refresher, BABIP is Batting Average on Balls In Play. It's (Hits - Home runs) / (At Bats - Strikeouts - home runs). The idea is that the batter has the least control over balls he hits onto the field, so the stat removes balls that were not put into play. The league average hovers within a few points of .300 almost every year. A hitter's BABIP has to be taken in context with other numbers. Alone, t's not always clear whether it means the batter is hitting the ball really well and fielders can't get to them, or the batter is just getting lucky and hitting weak balls that fall in for hits.

I expressed the difference above as -26 points, because for most batters BABIP is higher than batting average (because they're removing strikeouts). Among hitters qualifying for the batting title, the league average BABIP is .311* and batting average is .282. That's a difference of +29 points.

Pujols, on the other hand, is nearly that far in the other direction. His batting average is 26 points lower than his BABIP. Generally, players with higher BABIP than batting average hit a high percentage of their hits for home run and don't strike out a lot.

Following Sam Miller, I ran the numbers for all qualified hitters of the modern era (since 1988).

pacman::p_load("sqldf") 

batters = read.csv("../r-scripts/CSV/lahman/Batting.csv") 
batters.modern = sqldf('select playerID, yearID, sum(AB) as AB, sum(H) as H, sum(HR) as HR, sum(SO) as SO from batters where yearID > 1988 and AB > 480 group by playerID, yearID') 

batters.modern$AVG = batters.modern$H / batters.modern$AB 
batters.modern$BABIP = (batters.modern$H - batters.modern$HR) / (batters.modern$AB - batters.modern$HR - batters.modern$SO) 
batters.modern$Diff = (batters.modern$BABIP - batters.modern$AVG) * 1000
batters.modern$HomerPct = batters.modern$HR / batters.modern$H * 100 

higherAvg = (batters.modern[batters.modern$Diff < 0, ]) 
higherAvg = na.omit(higherAvg) 

print(higherAvg[order(-higherAvg$Diff), ]) 
cat("Mean difference:", mean(batters.modern$Diff), "\n") 
cat("Mean Batting Average:", sum(batters.modern$H) / sum(batters.modern$AB), "\n") cat("Mean BABIP:", (sum(batters.modern$H) - sum(batters.modern$HR)) / (sum(batters.modern$AB) - sum(batters.modern$HR) - sum(batters.2014$SO)))

# Fangraphs data for 2015 -- It has BABIP already. Nice! Go to bat!
fg2015 = read.csv("BABIPvsBA/FanGraphs Leaderboard-2015.csv")
fg2015$Diff = (fg2015$BABIP - fg2015$AVG) * 1000

higher2015 = fg2015[fg2015$Diff < 0, ]

Turns out Pujols's -26 would be a modern record, if it weren't for... Albert Pujols. Since 1988, there have been 154 such seasons. Pujols holds the largest negative differece: -37.3 in 2006. It's not even close, the next highest is Pujols again with -27.4 in 2004. Jose Bautista's 2010 season follows with -24.5. Bautista will drop to 4th if Pujols keeps up his pace this season. (2015 data from Fangraphs confirms no one else is lower than -13, Pujols is unrivaled)

Of those 154 seasons, 11 of them belong to Pujols. (This will most likely be his 12th). That comprises  every season of his career other than his rookie year and his injury-marred 2012-2013. In the modern era, Rafael Palmeiro had 10 such seasons, Gary Sheffield 9 and no one else has more than 6.

As a Cardinal fan, it's sad to see ALBERT FREAKING PUJOLS turn into Albert "Batting .237 and and sharing the lineup with Mike Trout" Pujols. But those BABIP numbers tell a story. Pujols has not only hit a large percentage of his hits for home runs, he's at the same time limited his strikeouts (a rare combination). That punishes his BABIP, since fielders are bound to be standing in the way of a certain number of the balls he puts in to play. But it means he's rarely giving away an out without at least putting the bat on the ball.

As Pujols continues ascending past the ranks of modern players, he's moving into territory no one has reached in decades. Fittingly, his 12th season will put him into a tie with Cardinals legend Stan Musial. Only Hank Aaron has more, with 13.

* I said above the average hovers around .300. That includes numbers from pitchers, utility players, and others who don't qualify for the batting title. They are typically worse hitters, and bring the numbers down quite a bit.

Monday, August 24, 2015

Mismatched Player Names

Names are one of the unexpected difficulties in software. We think we know how names work, but in actuality, most people just know how names [in their culture] [generally] work. (Side note: Icelandic naming conventions are pretty cool. Bjork isn't going by monomyn like Madonna or Bono. It's the normal convention to refer to Icelanders simply by their first name).

With that in mind, I did a perusal through the Lahman Baseball Database's Master table to see how many player's listed first and last names don't match up with their playerID.

For the unfamiliar, playerID is normally <first 5 letters of last name><first 2 letters of first name><two digit number>. As far as I can tell, the number is ordered by debut date. For example, there have been four players named Bob Adams. Their playerIDs are adamsbo01, adamsbo02, adamsbo03, and  adamsbo04, respectively.

There are exceptions, some boring, some understandable, and at least one nefarious. Let's dig in!

Guys Born Prior to 1900

There are quite a few inconsistencies in name data prior to 1900. Harry Atkin has a playerID adkinhe01. Some rudimentary googling didn't turn up any results, so I'm guessing his name was Henry, he went by Harry and somewhere a newspaper that referenced his playing days spelled his name wrong.

On the other hand, there's  Jersey Bakley* (bakelje01). His playerID seems to reflect the fact that, as Wikipedia states (without justifying their preferred spelling), "Sometimes his last name is spelled 'Bakely' or 'Bakeley'".

* Great name, by the way.

Then there's Home Run Baker (bakerfr01), whose nickname seems to have overtaken his actual given name. So prodigious was his power output, that he finished his 13 season career with nearly triple digit home runs (good for a 135 OPS+). Accusations that he may have been a vampire have gone unanswered by the Baseball Hall of Fame.

And what more need be said about Old Hoss Radbourn (radboch01)?

Other nicknames that became their first name:

Yip Owens (owensfr01): I wonder what his throwing motion looked like?

High Pockets Kelly (kellyge01): I can't find a source that explains his nickname. He was 6'4" so it may be his height.

Jumbo Harding (hardilo02): One of a surprising nine Major League players who have gone by "Jumbo". Many of them, including George Warren "Jumbo" Latham and Jose Raphael "Jumbo" Diaz (lathaju01 and dizaju03), are so well known as Jumbo that their playerID includes that nickname.

Sy Studley (studlse01): I heartily recommend this as an alias for trying to impress women in bars.

Guys Born after 1900

Coming soon!

This Keeps Coming Up on the Baseball Internet

Is this freaking adorable or what?
It's unbelievable how many people are, and continue to be, wrong on this topic: hot dogs are sandwiches. A sandwich is a food item composed of edible stuff placed in bread, in such a way that it can be eaten by hand.

By "placed in bread" I mean the bread has to already be bread when the items are placed in it (not dough). It may be cooked again after assembly. Therefore:
  • Paninis are sandwiches
  • Calzones are NOT sandwiches
  • Pigs in a blanket are NOT sandwiches
  • Beef Wellington is by no means a sandwich 
By "bread" I mean actual bread. Not some vaguely bread-ish starch/grain/corn product. As such:
  • Corn shells are not bread so tacos are not sandwiches
  • Tortillas are not bread so burritos are not sandwiches 
  • Lettuce wraps are wraps, not sandwiches
The bread can be in any configuration that allows eating by hand. QED:
  • Subs in a cut open loaf of bread (à la Subway) are sandwiches
  • Your prototypical "ham and cheese between two slices of bread" (à la Panera) are sandwiches
  • Shaved or pulled meats on a bun (à la Arby's) are sandwiches 
  • Open faced sandwiches are sandwiches in the same way prairie dogs are dogs; we give them the name by resemblance but we do not fool ourselves that they are the same thing
Hot dogs fit this definition and are sandwiches. Gyros fit this definition and are sandwiches. Hamburgers fit this definition and are sandwiches.

Don't be dumb, be smart.

Tuesday, August 18, 2015

Isolated On Base

If you follow baseball advanced stats at all, you're probably familiar with Isolated Slugging (ISO). It's a stat that separates a player's slugging percentage from his batting average, to get an idea of how much of his slugging is from extra base hits vs. how much is just a load of singles propping up a lack of power (I didn't say anything, Jon Jay, I don't know why you're looking at me like that).

The cool thing about ISO is its simplicity. You can calculate it in your head, just based on the slash line you would get on the back of a baseball card, or on the jumbotron at a stadium.

Is there a comparable stat for OBP? My hypothesis is that we can subtract batting average from OBP to get a fairly good estimate of walk rate. In homage to ISO, I call this stat IBO.

Comparing OBP to average is trickier that slugging and average. Both SLG and AVG are denominated on AB. OBP is more complicated: (H+BB+HBP) / (AB +BB+HBP+SF), per Baseball-Reference.

To test this out, I'm using the Lahman Baseball database data from 2014, for all qualified hitters (a total of 135 players).

First step, I created a table to store IBO values. I'm going to get a little hand-wavey here, because I don't want to post all the database structure code. Trust me on this, I'm pretty sure I subtracted one number from another successfully.*

That accomplished, I wrote an R script to pull the data and do some MATH!

library(RMySQL)

ammlb = dbConnect(MySQL(), user='myuser', password='mypassword', dbname='AMMLB', host='localhost')

rs = dbSendQuery(ammlb, "select * from vw_QualifiedIBO where yearID = 2014")
tempdata = fetch(rs, n=-1)
query_results = data.frame(tempdata)
leaders <- query_results[c("nameLast", "nameFirst", "BBPct", "IBO")]
leaders <- leaders[order(-leaders$BBPct),]
print(leaders)

The leaders are who you would expect, the guys who walk a lot:

Last Name First Name BB% IBO
Santana Carlos 0.1712 0.1341
Bautista Jose 0.1548 0.1176
Stanton Giancarlo 0.1473 0.1074
LaRoche Adam 0.1399 0.1027
Carpenter Matt 0.1344 0.1025
Smith Seth 0.1327 0.1009
Werth Jayson 0.132 0.1022
Fowler Dexter 0.131 0.0985
McCutchen Andrew 0.1296 0.0966
Freeman Freddie 0.1271 0.0973
Dozier Brian 0.1264 0.1027
Ortiz David 0.1246 0.093
Crisp Coco 0.1234 0.0902
Granderson Curtis 0.1208 0.0987
Valbuena Luis 0.119 0.0917
Rizzo Anthony 0.1185 0.1001
Trout Mike 0.1177 0.0899
Duda Lucas 0.1158 0.0961
Mauer Joe 0.1158 0.0841
Moss Brandon 0.1155 0.1005
Zobrist Ben 0.115 0.0824
Davis Chris 0.1145 0.104
Encarnacion Edwin 0.1144 0.0859
Teixeira Mark 0.1142 0.0971
Holliday Matt 0.1109 0.0985
Choo Shin-Soo 0.1096 0.0985
Donaldson Josh 0.1094 0.0875
Ramirez Hanley 0.1094 0.0862
Martinez Victor 0.1092 0.0736
Yelich Christian 0.1065 0.0788
Rollins Jimmy 0.1056 0.0799
Crawford Brandon 0.105 0.0774
Puig Yasiel 0.105 0.0867
Howard Ryan 0.1034 0.087
Heyward Jason 0.1032 0.0808
Gordon Alex 0.1011 0.0851
Lucroy Jonathan 0.1008 0.0716
Montero Miguel 0.1 0.0852
Upton BJ 0.0984 0.0786
Carter Chris 0.0979 0.0809
McGehee Casey 0.097 0.0673
Upton Justin 0.0938 0.0719
Beltre Adrian 0.0928 0.0634
Peralta Jhonny 0.0924 0.0735
Cano Robinson 0.0917 0.0677
Plouffe Trevor 0.0911 0.0705
Lowrie Jed 0.0904 0.0719
Kipnis Jason 0.0903 0.0705
Gardner Brett 0.0899 0.0715
Jennings Desmond 0.0882 0.0746
Cabrera Miguel 0.0876 0.0582
Markakis Nick 0.0873 0.0666
Kemp Matt 0.0868 0.0591
Rendon Anthony 0.0852 0.0639
Gonzalez Adrian 0.0848 0.059
Jones Garrett 0.0841 0.063
Pedroia Dustin 0.0837 0.0589
Abreu Jose 0.082 0.0661
Escobar Yunel 0.0819 0.0654
Longoria Evan 0.0815 0.0673
Cruz Nelson 0.0811 0.0625
Bruce Jay 0.0809 0.0643
Eaton Adam 0.0802 0.0615
Utley Chase 0.0798 0.069
Seager Kyle 0.0796 0.066
Aoki Nori 0.0795 0.0643
Walker Neil 0.0789 0.0706
Frazier Todd 0.0788 0.0634
Posey Buster 0.0777 0.0528
Ellsbury Jacoby 0.0773 0.0568
Brantley Michael 0.0769 0.0573
Span Denard 0.0752 0.0533
Freese David 0.0744 0.0612
Chisenhall Lonnie 0.0739 0.0625
Pence Hunter 0.0734 0.055
Gomez Carlos 0.0731 0.0721
Wright David 0.0717 0.055
Kendrick Howie 0.0715 0.0538
Gillaspie Conor 0.0711 0.0537
Calhoun Kole 0.071 0.0534
Desmond Ian 0.071 0.0587
Braun Ryan 0.0707 0.0581
Cabrera Melky 0.0695 0.0495
Pujols Albert 0.0691 0.052
Andrus Elvis 0.0681 0.0508
Butler Billy 0.068 0.052
Martin Leonys 0.0677 0.0508
Ozuna Marcell 0.067 0.048
Castro Jason 0.0665 0.0642
Brown Domonic 0.0664 0.0505
Bogaerts Xander 0.0659 0.0575
Hosmer Eric 0.064 0.0477
Mercer Jordy 0.0636 0.0506
Loney James 0.063 0.0464
Castellanos Nick 0.0622 0.0468
LeMahieu DJ 0.0621 0.0473
Morneau Justin 0.0618 0.0449
Castro Starlin 0.0615 0.0475
Navarro Dioner 0.0615 0.0429
Sandoval Pablo 0.0611 0.0456
Murphy Daniel 0.0607 0.0432
Marte Starling 0.0606 0.0651
McCann Brian 0.0595 0.0539
Ackley Dustin 0.0594 0.0481
Davis Khris 0.0583 0.0552
Reyes Jose 0.0582 0.0408
Infante Omar 0.0579 0.0428
Viciedo Dayan 0.0568 0.0492
Hamilton Billy 0.0565 0.042
Aybar Erick 0.0564 0.0429
Jeter Derek 0.0559 0.047
Simmons Andrelton 0.0557 0.0413
Byrd Marlon 0.0549 0.0484
Hill Aaron 0.0518 0.043
Hardy JJ 0.0512 0.0408
Segura Jean 0.0512 0.0432
Altuve Jose 0.051 0.0359
Blackmon Charlie 0.0483 0.0465
Dominguez Matt 0.0479 0.0417
Gordon Dee 0.0479 0.0371
Cozart Zack 0.0465 0.0464
Gomes Yan 0.0463 0.0343
Adams Matt 0.0462 0.0331
Hechavarria Adeiny 0.0457 0.0315
Rios Alex 0.0441 0.0304
Harrison Josh 0.0401 0.0313
Kinsler Ian 0.0401 0.0322
Ramirez Aramis 0.0395 0.0442
Hunter Torii 0.0392 0.0331
Johnson Chris 0.0378 0.0294
Escobar Alcides 0.0376 0.032
Ramirez Alexei 0.0366 0.0316
Perez Salvador 0.0363 0.0293
Jones Adam 0.0279 0.0298
Revere Ben 0.021 0.0185

So how do those numbers actually correlate?



That looks pretty good... but who are we, the old scouts from Moneyball? Let's do the numbers.

print("Correlation between BBPct and IBO:")
correlation <- cor(leaders$BBPct, leaders$IBO)
print(correlation)
[1] 0.9704839

0.97 correlation. Yeah, that seems alright. So we can get a general sense of what IBO looks like (at least last year):

.090+ Among the league leaders
.065 Above average
.050 Middle of the pack
.040 Below average
.019 Ben Revere

EDIT: it occurs to me that quartiles might be a useful measure, and I figured out you can do it with a single command in R!

quantile(leaders$IBO)
[1]   0%     25%     50%     75%     100%
   0.01850 0.04715 0.06150 0.08035 0.13410

So that's nice, but can we use it to extrapolate walk rate?

Tune in next time!

* Ha! Double check on me when this gets up to BitBucket. I'm working on getting it ready to open source.