Statistics have been a significant part of the baseball experience since the game's inception, but with the introduction of Statcast in 2015 and its migration to the Google Cloud Platform in 2020, Major League Baseball has been able to transform the data and insights it can deliver to fans and teams alike.
For decades, numbers like 61 (Roger Maris' single-season home run record), 755 (Henry Aaron's career home run record) and .406 (Ted Williams' batting average in 1941 when he became the last player to hit over .400) have been ingrained in fans' minds, never mind that Maris' and Aaron's marks were surpassed during the steroid era.
Whether consumed in box scores on the sports pages of local newspapers or the backs of baseball cards, statistics and baseball were joined perhaps more closely than in any other sport.
But the delivery and consumption of baseball statistics began to change about 20 years ago.
MLB's technological evolution
In 2001, MLB introduced live play-by-play on MLB.com so fans could follow games on their computers and see live box scores. Five years later, it added three-dimensional pitch tracking capabilities to tell fans in the ballpark and others watching on televisions and other devices the speed and type of pitch being thrown.
Then in 2015, MLB installed Statcast, a tracking technology that enabled MLB to collect and analyze data in ways it couldn't in the past. With the ability to track the movements of players and the ball while in play, new analyses were suddenly available.
"This was able to capture every movement of the ball, including home-run distance," Rob Engel, senior director of software engineering for Statcast data at Major League Baseball, said in a breakout session during Google Cloud Next '21, a virtual conference hosted by the tech giant.
"Additionally, player tracking allowed us to gain new insights about who the fastest players were, who had the best arm strength among outfielders, the probability a ball would be caught as it's hit in the air and who the best blocking catchers are," he continued.
Finally, in 2020, MLB adopted Google Cloud Platform to power Statcast, and with its migration to Google came added analytics capabilities.
MLB now does what it terms 3D pose tracking, with sensors tracking 18 points on every player's body and taking 30 images from each of those sensors per second.
That data from Statcast -- 540 images per player, per second plus pitch and batted-ball information -- is then ingested into Google Cloud and almost instantaneously delivered and shared to fans online, announcers broadcasting and analyzing the action, and even the 30 clubs themselves.
"This allows you to render an entirely 3D immersive view of the game that's never been seen before," Engel said. "Think of it almost like a video game, and there are so many things we can do with this technology."
Using the Google suite
But despite what MLB can now with Statcast and Google Cloud, there were challenges.
Those 18 sensors tracking every movement of every game collect enormous amounts of data, and making that data insightful and actionable in real time requires a powerful infrastructure.
According to Engel, MLB ingests about 30 terabytes of data each season.
Rob EngelSenior director of software engineering for Statcast data, Major League Baseball
All that data not only needs to be curated and delivered to teams so they discover insights that fuel in-game decisions and make informed player personnel decisions, but also needs to be fed into gamecasts, statistical leaderboards and league standings in real time.
It has to be ingested, processed and delivered instantaneously to give context to the action by showing the distance of a home run, detecting whether a record has been broken or something has perhaps occurred for the first time.
"As soon as a ball is hit, we want to … provide context," Engel said. "We want to know if it's the longest ball that's ever been hit, or the longest hit by that player, and how many ballparks it would be a home run in. We're able to identify things about rookies, like that Steven Matz was the first pitcher to record three hits and drive in four runs in his first Major League game."
Beyond just providing context, however, MLB wants to provide it in a visually interesting and digestible format, he added.
"All this data is excellent, and we data nerds love it, but we also want to present it visually to our fans," Engel said. "We have a big suite of products that take this data we're generating and deliver it in real time through a vast array of APIs."
That suite of products used to manage 30 terabytes of data collected by Statcast over six months and make it actionable in real time includes:
- Google Anthos, an on-premises Kubernetes engine to record and process Statcast data in ballparks as it's collected and then send it to Google's cloud;
- BigQuery for real-time query results that enable teams and MLB itself to discover trends;
- Looker for business intelligence and visual analytics that lead to business decisions by individual teams and MLB; and
- database capabilities including Bigtable to manage large workloads and Cloud SQL Postgres to quickly replicate data across regions.
Partnership and potential
MLB chose Google Cloud in 2020 because of Google's lengthy history managing big data, the trust developed through its partnership with Google, and the cost of the platform, according to Engel.
The result, he continued, is that MLB's analytics capabilities are among the most advanced in all of sports. Among the four major North American sports, the National Football League and National Hockey League each use AWS to deliver advanced statistical analysis while the National Basketball Association has a partnership with Microsoft.
"MLB has forever had first-in-class digital products, and we want to keep up the pace and continue to innovate in this space," Engel said. "We want to compete with other sports. Baseball has a huge, rich history in statistics, and we have great products that visualize this. Now that we're recording 3D player tracking in real time, think about all that we can do with this."
One of those things, he continued, could be virtual reality.
Someone could theoretically put on a headset and, based on the data captured by Statcast and turned around in real time by Google Cloud, "watch" a game from not only any seat in the stands but from a spot on the playing field itself.
"The possibilities are truly limitless," Engel said. "This technology is going to be extremely exciting in the years to come."