Databricks' lakehouse capabilities helped the Texas Rangers win the 2023 World Series.
Nothing, of course, was more critical to the team's success than the performance of the 26 players on the roster each game and the decisions made throughout the season by manager Bruce Bochy and his staff.
But Databricks' lakehouse platform powered Texas' use of analytics to inform general manager Chris Young and the decisions that led to the construction of the Rangers' roster. And it fueled the analytics to inform the decisions made by the players, Bochy and the Rangers' coaching staff as the team played each opponent over the course of 162 regular season and 17 postseason games on their way to winning the franchise's first championship.
Based in San Francisco, Databricks is one of the pioneers of the lakehouse architecture for data management. Lakehouses combine the technologies of data warehouses, for structured data management, and data lakes, for unstructured data management, so that users can combine disparate data types to develop a more comprehensive view of their operations.
In addition, lakehouses are capable of handling data at massive scale, which makes them perhaps the ideal data management platform for the modern analytics, machine learning, traditional AI and generative AI needs of data-driven organizations.
The Rangers are one such organization. And a data management platform that can handle huge amounts of data and process that data in near real time is exactly what they needed.
Major League Baseball, like most enterprises as worldwide data volume has increased dramatically in recent years, has experienced an explosion of data volume since analytics first rose to prominence in the sport about two decades ago.
By 2021, the Rangers' existing data infrastructure was no longer able to accommodate all the data the team required to keep pace with the rest of the sport, according to Alexander Booth, the Rangers' assistant director of baseball R&D.
"There's been a huge organizational shift over the last five years in [thinking about] what data is and what it can be used for," he said. "We've revamped a lot of staff ... and a lot of those people are now driving the questions for the analysts to answer. It's been an awesome shift."
When analytics first became popular in baseball, it was centered around statistics such as on-base percentage and slugging percentage, simply calculated metrics that could be kept with pad and pen or in an Excel spreadsheet.
In the ensuing years, organizations began developing such statistics as expected batting average on balls in play and fielding independent pitching. In addition, they more closely tracked the tendencies of opposing players to increase the likelihood of defensive success with positional shifts or knowing an opposing pitcher's propensity for throwing a particular pitch in each circumstance.
Even that data, however, wasn't so voluminous that it couldn't be stored in spreadsheets and traditional databases.
Then, over the past handful of years, came motion capture. And existing data management tools were quickly overwhelmed.
Now, teams such as the Rangers position cameras throughout their stadiums to track every movement made by every player on the field as well as the movement of the ball from the instant it leaves the pitcher's hand to the moment a play is over. Thousands of images are captured per second of action.
The data provided by those images as well as data provided to all franchises by Major League Baseball enables teams to develop metrics that would have previously been impossible.
Teams can track the spin rate of a pitch out of the pitcher's hand to understand which pitches have the most potential for vertical or horizontal movement, the path of a hitter's bat to see whether a batter is giving themselves the best opportunity to make contact, and the route of a fielder to a batted ball to know whether they're taking the most optimal path to make a catch, among other things.
All that data needs to be ingested, integrated, prepared and analyzed to be of value.
Before 2021, the Rangers were using on-premises data management tools from Microsoft and Oracle. They were enough before baseball teams started collecting hundreds of thousands of images -- each an unstructured data point -- per game. But once data volume exploded, those tools were no longer sufficient.
"In 2021, we had an issue where we couldn't ingest all this data," Booth said. "Specifically, we couldn't ingest all the biomechanics data. We started looking around for solutions."
The Rangers needed a cloud-based platform, Booth continued. And they needed a platform that could handle both structured as well as unstructured data and manage all that data at scale.
Otherwise, they risked being at a competitive disadvantage as baseball analytics continued to evolve.
Booth noted that the Rangers were aware that the Minnesota Twins -- a small-market franchise with a much smaller budget than big-market teams such as those from New York, Los Angeles, Chicago, Boston and Philadelphia -- were having success using Databricks' lakehouse to manage their data.
Analytics has shown to be an equalizer in baseball.
While big-market franchises including the Yankees, Dodgers and Red Sox have historically outspent others such as the Oakland Athletics, Tampa Bay Rays and Twins, those small-market franchises have been able to remain competitive by using analytics to make smart decisions.
Each of those teams, despite payrolls sometimes less than a quarter of those of the highest-spending teams, has won multiple division titles over the past decade. In 2015, the small-market Kansas City Royals won the World Series despite spending more than $100 million less on their roster than the Dodgers and Yankees.
The Rangers historically fall somewhere in the middle in terms of payroll. In 2022, when the team won just 68 games, Texas ranked 16th, according to data repository The Baseball Cube. Last year, when the Rangers improved by 24 regular-season victories and won the World Series, they still ranked just ninth in terms of payroll.
"Baseball is a very copycat sport," Booth said. "When one team has success with a vendor, it's a good motivator."
Nevertheless, when Texas determined that it needed more advanced data management tools, Databricks was just one of the vendors the franchise looked at. In addition, the Rangers experimented with tools from AWS and Snowflake, according to Booth.
Alexander BoothAssistant director of baseball R&D, Texas Rangers
The Rangers are partners with AWS and still use AWS tools as part of their data stack. But in the Rangers' view, neither AWS nor Snowflake measured up to Databricks when the team ran proof-of-concept testing.
In particular, they found that the Databricks lakehouse was better for connecting data sets and curating machine learning models.
"Snowflake was cost-prohibitive, and AWS was not the most user-friendly," Booth said.
Databricks, meanwhile, proved able to scale without costs spiraling out of control, and its lakehouse was accessible to the Rangers' data team.
"Databricks stood out to us," Booth said. "We weren't the first franchise to start using Databricks -- that belongs to the Twins. In building out our new data lakehouse, Databricks was the perfect solution."
Now, three years after first adopting Databricks, the Rangers use the platform not only to help inform decisions at the major league level, but also to inform amateur and international scouting, as well as player development in the minor leagues.
In addition, Booth noted that the business side of the franchise is beginning to discover some of the same data management limitations of on-premises tools that the baseball operations team discovered before 2021 and is beginning the process of also transitioning to Databricks. The Rangers' current data stack includes various databases in which data gets cloned, Amazon Simple Storage Service for storage, Databricks for data transformation and machine learning, and finally Tableau for data visualization.
Databricks in operation
While the Rangers use data to inform decisions, analytics is only part of the decision-making process, according to Booth.
"What we do is make informed decisions using as much information as we possibly can, whether that's coming from domain expertise -- scouts, coaches and players who have experienced the game -- or the data that's coming from the models and KPIs," he said.
Frequently, the data products developed by using Databricks' lakehouse tend to back up what Young, Bochy, coaches and players already glean by watching the game, Booth added.
"It really helps justify some decisions and pairs really well with some of the recommendations and observations that coaches and scouts are making," he said. "Analytics aren't replacing or overriding decisions. Analytics are a tool to help our decisions, help our observations be more confident."
From a personnel standpoint, those decisions begin with amateur and international scouting, where teams identify the players they want to draft and sign; international players are not subject to the Major League Baseball draft. Evan Carter, an outfielder taken in the second round of the 2020 draft who made his major league debut last season, is one player analytics helped the Rangers evaluate.
The decisions continue through the development phase of a player's evolution in the minor leagues. Third baseman Josh Jung, who hit 23 home runs as a 25-year-old rookie in 2023, was drafted by the Rangers and spent four years in the franchise's minor league system before becoming a fixture in the Rangers' lineup last season.
Finally, the decisions include which free agents to target and what trades to make to ultimately build the roster. Corey Seager, Adolis García, Marcus Semien and Nathan Eovaldi were all key signees over the past few years who helped turn the Rangers around. Max Scherzer and Jordan Montgomery were midseason trade targets who played key roles in the team's run to the World Series.
Meanwhile, Booth noted that a key statistical undertaking in recent years had to do with defensive positioning.
"A lot of the credit with all the defensive plays that were made during the World Series goes to the talent of Semien and Seager," Booth said. "But I like to think that our defensive positioning put them in the right spot to be more likely to make some of the amazing defensive plays. That's been really cool to see."
In addition, Booth said the franchise's commitment to analytics and ability to provide useful information to players has spurred curiosity. Players now come to the baseball operations team with questions rather than just wait for Tableau reports to be fed to them.
"A lot of the questions that drive our new KPIs are coming from coaches, players and other staff," Booth said. "We have to be able to create those KPIs based on their questions. Our old stack would not have been able to calculate those KPIs. Without Databricks, we would be unable to translate data at scale."
In the six years since Booth joined the Rangers, the franchise's R&D team has grown from four people to 24, reflecting the emphasis Texas now places on analytics.
During those six years, in addition to deploying motion capture technology to develop new metrics, one of the KPIs that the Rangers have developed using Databricks is wins above replacement (WAR), which aims to show how much better or worse a player is than the average player at their position. The statistic has been calculated in one form or another for more than a decade, but the Rangers use their own formula that differs from publicly available WAR measurements provided by Baseball Reference and ESPN, among others.
Another metric is called Stuff+, which aims to quantify pitch quality. Pitches are commonly referred to as "stuff," and Stuff+ uses measurements including spin rate, velocity and movement to place a value on the chances of a pitch being successful.
"The idea is that the raw qualities of a pitch are more likely to lead to success when put into a game environment," Booth said.
Stuff+ has the potential be particularly valuable when predicting the success of international and amateur players.
"We now have metrics and KPIs that can give us confidence that a player's skills will play at the major league level -- not just because they're [dominating their competition], but because the quality of their stuff is similar enough to the quality stuff at the major league level," Booth said. "That should inspire confidence that their pitches will play."
Looking ahead, the Rangers plan to make use of generative AI.
Databricks has invested heavily in generative AI over the past year through the acquisition of MosaicML and the introduction of a suite of tools including vector search to aid generative AI development.
Generative AI is good at analyzing text. Baseball teams, meanwhile, generate thousands of reports over the course of a season. Generative AI has the potential to help the Rangers more quickly make sense of all those reports than a human can.
"GenAI can quickly look up, find and summarize reports and phrases, and make them consumable," Booth said. "We've created some GenAI tools already around language, and we're open to expanding our use cases there."
Generative AI is also good at generating new code and debugging problematic code, he continued.
Meanwhile, the Rangers' use of Databricks' traditional lakehouse tools to develop more KPIs will continue to increase, according to Booth.
The franchise is now undertaking a project to track weather data and examine how wind speed, temperature, humidity and barometric pressure affect every batted ball at the Triple-A minor league level and major league level.
In addition, biomechanical tracking is beginning to extend down to the high school level. That means thousands of new sources, each producing hundreds of thousands of new data points.
"I want our team to be able to take in any data over the next five years and quickly make it available to stakeholders," Booth said. "If other teams feel there's too much data and they can't track biomechanics or wind speeds, that's our advantage. We believe Databricks is the tool that is going to make us future-resistant."
Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.