Predicting hits with Statcast.

Using batted ball information to model hits.

Most of what happens in a baseball game are influenced by chance. A ball hit on the screws can end up in the outstretched glove of a diving fielder. The outfield wall could be just 6 inches too tall, keeping a home run in the park. Strike three could be called ball four by the home plate umpire. Traditional statistics can’t account for all of this, hence why sabermetricians have developed context-specific statistics like DIPS (defense independent pitching statistics) or wRC (weighted runs created). These stats try to explain the outcomes of batted balls while controlling for defense and ballparks.

I sought out to try and create a model that controls for defense, but from the hitter’s perspective. A model that could predict batted ball outcomes could be used to better evaluate hitters and their quality of contact. Using 2017 MLB pitch-by-pitch Statcast data’s batted ball statistics (launch angle, exit velocity, outcome and spray angle), I used a random forest to model whether a batted ball would be a hit or an out. I trained my model on 20% of the data, and felt confident the training set and test set were identical, with similar means and standard deviations for launch angle, speed and spray angle.

I chose to use a random forest because it runs multiple decision trees on subsets of the training set and averages the results across the sets. A Random Forest model uses k-decision trees, or binary ‘decision’ or outcome model, to model the data. Random forest algorithms minimize variance and bias through averaging; a random forest helps prevent overfitting, something I was afraid of doing. Using the Random Forest provided much better accuracy than running a Logistic Regression, my alternative hypothesized model, due to the number of trees (10) and the nature of a decision tree versus a regression.

 

Without further ado, the results (in visual form):

Actual Hits & Outs.jpg  Predicted Hits & Outs

There’s quite a bit going on in these plots. Let me break it down.

These plots are of every fair ball hit (with a few misclassifications) in 2017 and their landing (or caught) locations. The dark blue balls in play are hits, while the light blue balls are outs. On the left are the actual hits and outs, while on the right are the predicted hits and outs. There are almost a hundred thousand points on these plots, making it difficult to sift through. Here is an explanation of these plots in tabular form:

correct

My model does a much better job at predicting outs than hits. It was correct almost 90% of the time at predicting outs, compared to merely 66% of the time predicting hits. From From the perspective of hits being good (the batter’s perspective), 10% of outs were false positives, and 34% of hits were false negatives. I believe my model did better with outs because there are many more outs than hits – league-average BABIP is .300, or 30% of the time a ball in play is a hit, 70% of the time it’s an out. The model was accurate 81.4% of the time. Despite the high accuracy, the model only ran a .1769 R-Squared. That is, the model was able to describe 17.7% of the variance in batted ball results.

Overall, I feel this model can help predict batted ball results. Two main drawbacks of the model are that it only predicts hits instead of the type of hit and that it requires more data to increase accuracy. I believe having fielder data, such as shifts and defensive capabilities, would greatly increase the accuracy of the model, though at the risk of overfitting (given the small samples of fielded balls in certain areas).

I plan to explore this model further, and look at individual batters to compare their actual hits to the predicted ones.

 

– tb

 

Edit: My post was featured on Fangraph’s community research section! It can be found here:

https://www.fangraphs.com/community/an-attempt-to-predict-hits-with-statcast/

A consideration on rounding with Barrels and Statcast.

My thoughts as I analyze the consistency of Barreled Batted Balls calculations.

Recently, I began exploring quality of contact statistics through Baseball Savant’s Statcast search engine. My goal was to either affirm or disprove Mike Podhorzer’s recent theory on the increase in MLB HR/FB%. In short, Mike theorized that the driving factor in this league-wide increase in HR/FB rate is an increase in hard-hit balls hit at ideal launch angles.

A Barreled ball is one struck ridiculously well, according to launch angle and exit velocity. Specifically, a Barreled ball would, on average, be expected to produce a batting average of .500 or greater and an isolated slugging of 1.500 or greater. In advanced terms, a Barrel in general would result in a wOBA above or equal to .950. For a batted ball to be classified as Barreled, it has to satisfy four conditions (forgive me for short-handing mathematical equations): exit velocity * 1.5 – launch angle >= 117, exit velocity + launch angle >= 124, launch angle <= 50, and exit velocity >= 98. These conditions were provided to me by Tangotiger.

While cleaning and organizing the data, I realized there were inconsistencies with Barrel classifications amongst batted balls. Namely, there are, in my estimation, 567 more Barreled balls classified through the Statcast search than there are through applying the Barrel conditions directly.

The root cause of this discrepancy is rounding of exit velocity and launch angle prior to condition testing. This causes hundreds of false positive Barrel classifications. For example: I noticed the minimum exit velocity in the group of Barrels exported from Statcast was 97.5 mph. One of the four conditions for Barrel classification, however, is that the batted ball’s exit velocity has to be 98 mph or above.

As far as I can tell, all of the false positives are within a half mph of the exit velocity condition. The majority of them only fail the exit velocity minimum condition. Some of them, however, fail a second condition. Since the rounding occurs prior to condition testing, the calculations to test the conditions are inaccurate. The first condition listed above (exit velocity * 1.5 – launch angle >= 117) failed in many cases, though close enough to round into satisfying the condition. A select few times, though, the calculations were far enough off to not satisfy the condition despite rounding the calculation. Looking back at the batted balls not classified as Barrels, there are some batted balls that are similar to those misclassified.

This post is more of a warning than anything else. One who chooses to use Barrel batted ball classifications should either test the four Barrel conditions themselves or rely on Statcast’s method of rounding prior to testing. Whichever method is used should be clearly mentioned, for clarity of information and reliability of process. This allows for potential replication of analysis – communal confirmation of a study’s results.