Baseball's Twins deploy Databricks to improve analytics power
With the amount of data used to make player personnel decisions in baseball growing exponentially, the Minnesota Twins began using Databricks to improve their modeling capabilities.
Analytics have always been a huge part of baseball.
Long before anyone heard the term analytics, statistical tendencies were the drivers behind decisions from Major League Baseball all the way down to children playing on sandlots and open fields. Baseball analytics involved choices like putting a team's most powerful hitters in spots where they'll have the most potential impact to simply shifting defensively toward the right side of the field when a left-handed batter comes to the plate.
But over the last two decades there's been an analytics explosion in Major League Baseball, beginning with the realization that statistics such as on-base percentage and slugging percentage are more accurate measures of a player's value than batting average and runs batted in. Now, teams have moved well past simple compilation and computations and are able to do things like analyze the spin rate on curveballs and sliders to help determine the potential long-term effectiveness of pitchers.
Meanwhile, as the financial gap between large- and small-market teams has grown with the absence of a salary cap (teams must pay a luxury tax if their payroll reaches a certain point, but they are free to go over that and pay the fine if they choose), the importance of analytics has grown in baseball to maintain a modicum of competitive balance between the financial haves and have-nots.
One of the small-market teams that has most effectively remained competitive over the past two decades has been the Minnesota Twins. Among teams playing in a market ranked in the bottom half of MLB in terms of population, beginning in 2000 only the Oakland A's (10) and St. Louis Cardinals (13) have reached the playoffs more times than the Twins' seven. And after an eight-year drought from 2011-18, they went 101-61 last year to win the American League Central Division and are 10-6 to start this season.
Analytics, not surprisingly, are an important part of the Twins' decision-making process, and this winter the franchise started working with big data and machine learning vendor Databricks to take its analytical capabilities to a new level.
Jeremy Raadt, the Twins' director of baseball systems, and Zane MacPhee, the team's coordinator of professional scouting research and development, recently discussed the Twins' adoption of Databricks to help develop predictive models -- most having to do with the reams of new pitching data available to teams -- and quickly run millions of simulations on those models in order make player personnel decisions more quickly.
In addition, they spoke about the Twins' dedication to analytics, how difficult it can be to stay ahead of other teams' analytics capabilities, and even one of the players analytics helped identify who other teams had overlooked.
When did the Twins start using Databricks to help with analyze and predict player performance?
Jeremy Raadt: That started last winter. Over the last year or so our team -- the R&D team -- has grown quite a bit, and along with that the amount of data in the sports world has just exploded over the past few years with Statcast, sensors and other real-time data now available to us. It came to a head this winter where some of our models were taking days, even weeks, and we projected some of them would take years if we really wanted to do as much as we wanted to do. We knew we needed some different tools in our toolkit to be able to handle it, so we started looking around at different things.
The Twins are pretty Microsoft-centric, so we use Azure for everything based in the cloud, so we used Azure for some things. But we were kind of [patching things together] to make it work and we knew there was a better way. We started exploring Apache Spark [from Databricks] and other things, and Databricks has a rich integration with Azure so that's why it popped up on our radar as something we wanted to look into. Around December we started talking with Databricks to understand a little more about what it is, and in January we started working with them. They showed us how to use Databricks, best use cases for sports, and really helped us along because Databricks is more than just Spark. It's a whole ecosystem of tools.
Did you look at any other analytics platforms?
Raadt: We looked at a lot of the Hadoop sites, we looked at the [Amazon Web Services] space a little bit, and we spent some time looking at Google Cloud, some of the BigQuery stuff because MLB has moved a lot of its stuff to BigQuery. In the end, what we fell in love with about Databricks is that it's that whole ecosystem of things versus just solving one specific problem. We realized we needed to solve a lot of problems, like how to store models, how to test them out and how to get a whole series of new analysts all growing in the same direction. There was more of a predefined recipe with Databricks to piece things together.
What were you looking to apply to baseball with Databricks that you couldn't with your previous analytics tools?
Zane MacPhee: We've got a lot of varied data sources, but we're using fairly large data with all our pitch and player tracking data, so we were getting to a point with our analysis where we wanted to make a stepwise improvement. We realized that the level of simulations we wanted to run and the number of questions we needed to answer were going to require restructuring how analysts developed their models and the deployment of those models, so we wanted to improve our development feedback loop. We wanted to cut down on training times of models, and if we wanted to deploy a couple million simulations, we needed to use a technology that would allow us to do that in an interactive way … and allow us more instant feedback. When you're evaluating a player or a trade, you don't have a week to make that decision. You have about a day, so you want to give decision-makers instant feedback.
Zane MacPheeCoordinator of professional scouting research and development, Minnesota Twins
Raadt: There are also a lot of 'what-if' questions we want to ask, like what would happen if we change a pitcher's pitch mix, or what if we tweak the pitcher's curveball to get it to do this type of action? All those what-if questions where you don't have that historical data, we want to be able to simulate millions and millions of times to get a good answer. That's where we needed additional horsepower so it wouldn't take weeks to generate the simulations.
What are some of the more advanced baseball statistics you're now incorporating into your analytics that go beyond what a fan might see in a wins above replacement [WAR] formula?
MacPhee: In the public sphere now you're beginning to see the use case and model building around pitch tracking data. MLB teams have had access to this data for many years, and also we have coverage at the minor league level, so in terms of advanced metrics, they're all around this player tracking data. At the pitch level, we have the ability to assess from a model building and systems level the value of certain pitch types and pitch movements. That's generally what those new metrics look like and the big difference between a Major League team and what the public-facing information is.
When you see the results of millions of simulations, what are you seeing -- is it just a more advanced value score or are you getting a report with a detailed explanation?
Raadt: From Databricks we get a raw version of the valuation we're trying to do, so when we previously had to keep valuation at a higher level, now we're able to evaluate the value of a certain type of break on a certain type of pitch thrown in a certain location. We can get really fine-grain now and then build up the valuations from there versus having to keep it high-level before because it would take too long to generate or simulate that type of data. It's being able to get more fine grain in order to tease the luck out.
Can you give an example of how what you're able to do now with analytics has led to a baseball decision?
MacPhee: We get basically 100 metrics from every single pitch, and from there we can start building models with help from Databricks on the infrastructure side. We can simulate that pitch in different locations, simulate that pitch against different players, and that allows us to then build those models from the base level and allows us to quantify some of the uncertainty around observed performance and determine how much was skill and how much was luck. It's a big testament to the technology Databricks can provide that we can handle that amount of data in an efficient way.
Raadt: And there are pitchers in our organization that are here because the data helped back up the scout -- the data will never be the one and only answer -- and create cases for certain players. There are players in our organization the data created a strong case for, and then also it creates strong cases for different types of development once they're in our system.
Is the next-level analytics at this point most applicable to pitching and not as much to hitting and fielding?
Raadt: Most of the data we have right now is on pitching. We have so much data available for that, and less so for hitting. But there are definitely different sensors we use to capture hitting data. The new Statcast system is able to do a lot with the trajectory of the bat and things like that. It's pretty exciting. Defense has always lagged behind, but now the new Statcast system can get skeletal points on every fielder every fraction of a second and the data is exploding so what we'll be able to do from a fielder's standpoint is pretty exciting.
How difficult is it now in Major League Baseball to stay ahead of the analytics curve and be at the forefront the way Billy Beane, one of the early pioneers of modern baseball analytics, was with the Oakland A's 20 years ago?
MacPhee: It's an arms race -- maybe that's a little overkill, but it's investing a lot of people resources, money resources and time into not only the data collection, but into making the data actionable as quickly as possible to improve player evaluations and find players that are maybe undervalued in the market. From the Twins' perspective, we kind of saw this coming. We're only going to get more data, and we need the infrastructure to allow us to ingest and integrate it into player acquisition and evaluation is responsively as possible.
Raadt: Every team has access to a similar amount of data -- and it's an absolute mountain of data -- but what we've learned is data is great but it's not valuable if it's not actionable, so we challenge ourselves to make sure we're being actionable with the data and we can react fast. That's where a lot of the competitive advantage lies. It's that speed that goes back to Databricks that allows us to tease out the luck faster than others can.
How important is the commitment to pushing analytics in baseball to keeping a small-market team like the Twins competitive?
Raadt: It's really important. We had new leadership come in a few years ago. They brought a really strong evidenced-based approach. Not every decision we make is going to be the winning decision, but if we keep to the evidence and keep making decisions based on that, we're going to win more than we're going to lose in our decision-making. They've invested strongly in technology and analytics, and embedding it into each part. Instead of having siloed little areas in the baseball department, we're a lot more together and we have analytics embedded into player development, into scouting, into acquisitions.
MacPhee: Adding to that, from the leadership level, they're analytics-based and they want to know all the information possible when they're making a decision. That's information we provide at a systems level and it's also scouting information. They want all the information possible to make the best decisions. And on a cultural level, we're incredibly curious, even about what other teams are doing. If we get a call on a player from a team we have a lot of reverence for in the analytics space for we're wondering what they're thinking that we're not so that we won't get beaten on that player. That kind of thinking pushes us forward and makes sure we're never resting.
Is there a player analytics helped you identify who other teams missed on, like Scott Hatteberg 20 years ago with the A's who was highlighted in Moneyball (the book by Michael Lewis that documented the start of the baseball analytics movement)?
Raadt: One success story right now where analytics played a part but it was also our scouting is Randy Dobnak, who has blossomed as a starting pitcher. He was someone who was playing independent baseball and driving an Uber just a few years ago. He's a cool story about finding the value when you marry scouting with analytics. When you can get that together and both sides agree, it's incredibly powerful.
MacPhee: That's a story about synergy between departments across baseball operations. It's a testament to our independent baseball scout who identified him early on, and then once he was in our system using this evidenced-based approach.
Where are analytics in baseball headed?
Raadt: I think there's going to be a lot on the medical side and [mitigating] fatigue -- the training and player performance area. In the past when we've tried to monitor workloads and how much strain people put on their bodies, it was a lot more using your eyes and being subjective. Now, we can start identifying different joints. It's going to be really interesting to watch the performance science area in the next few years. And that's where you'll need some big data tools to handle it all because that data is way bigger than any of the pitch data we have now.
Editor's note: This Q&A has been edited for clarity and conciseness.