Evaluate Google Cloud TPUs for machine learning apps

Machine learning apps have prompted a paradigm shift in the world of public cloud. Google is an early market leader with its TPUs and TensorFlow, but can it stay on top?

Paul Korzeniowski

Published: 18 Jul 2018

To support a growing number of machine learning workloads in the cloud, Google has rolled out special-purpose microprocessors as part of its Compute Engine platform. The move was one of many recent examples of the major cloud providers racing to support AI applications on their infrastructure.

"A confluence of market and technology factors is forcing vendors, like Google, to develop high-performance cloud services," said Karl Freund, consulting lead for high-performance computing and deep learning at Moor Insights & Strategy.

Google designed its chips, called Cloud Tensor Processing Units (TPUs), to improve the performance of machine learning apps that use the open source TensorFlow framework. And while Google Cloud TPUs won't be a fit for all enterprises, they do offer a number of benefits over CPUs and GPUs for training and deploying machine learning models.

A change in design

For decades, Intel CPUs supported most data center workloads. If IT had to deploy a high-performance application, such as an e-commerce system, it could fairly easily and cost-effectively add enough of those Intel processors to run the workload.

Machine learning apps, however, changed that. Unlike more traditional enterprise applications, which largely work with structured data that CPUs handle well, machine learning apps rely on large, unstructured data sets that can create more sporadic workload patterns and chew up processing cycles.

"Machine learning applications, especially deep learning, work with extremely large volumes of data -- billions or even trillions of bits," Freund said.

What's more, the end results of machine learning system processing differ from those of legacy applications, said Alan Priestley, research director at Gartner.

Moving forward, Google views itself as an artificial intelligence [and] machine learning vendor.

Karl FreundMoor Insights & Strategy

"Traditional applications were rules-based and relied on yes or no results," Priestley said. "Machine learning produces probabilistic output -- [for example,] there is a 94% certainty that an object is a guitar, 3% that it is a dog and 3% that is it is something else."

Because machine learning apps have these unique requirements, traditional CPUs and even GPUs -- which have gained traction for larger, compute-intensive workloads -- don't always support them sufficiently. As a result, many IT teams ran these applications on specially designed and expensive supercomputers. The proliferation of cloud, however, has changed that, giving enterprises on-demand access to technology that IaaS providers design specifically for AI and machine learning apps.

Google Cloud TPUs, alongside its other machine learning services, play into the provider's larger vision to become a dominant player in this market.

"Moving forward, Google views itself as an artificial intelligence [and] machine learning vendor," Freund said.

The limitations of TensorFlow, Google Cloud TPUs

Google designed its TPU chips, which it used internally before making them available as part of its cloud platform, with the open source TensorFlow framework in mind. Moving forward, a raft of special-purpose applications is expected to be built around TensorFlow, as industries ranging from automotive to manufacturing embrace machine learning.

Still, developers and IT teams who work with these technologies face some challenges. For example, the number of development, monitoring and management tools available for Google Cloud TPUs and TensorFlow is still somewhat limited. And, in some cases, enterprises might need to integrate machine learning apps with legacy systems -- a task that can be daunting, given this lack of more mature tool sets.

Cost could be another deterrent for some organizations. Google charges $6.50 per Cloud TPU per hour, on top of the cost of the Compute Engine VM to which the TPU needs to connect. The vendor offers a lower-cost option in its pre-emptible TPUs -- $1.95 per TPU per hour -- but the trade-off is that Google can terminate these TPUs at any time, based on resource demands.

By comparison, Google's lowest-end GPU model -- the NVIDIA Tesla K80 -- starts at $0.45 per die per hour for the on-demand option and $0.135 per die per hour for the pre-emptible option. Similar to the TPUs, users are also charged separately for the VM instances that connect to GPUs.

Lastly, while Google has made significant strides in the cloud machine learning market, it's by no means alone. It faces some tough competition from not only the other major public cloud providers, such as AWS and Microsoft, but also from companies like IBM -- and a plethora of startups that target this emerging market. Which vendors ultimately gain the most acceptance among enterprise users remains to be seen.

Essential Guide

Evaluate Google Cloud TPUs for machine learning apps

Machine learning apps have prompted a paradigm shift in the world of public cloud. Google is an early market leader with its TPUs and TensorFlow, but can it stay on top?

A change in design

The limitations of TensorFlow, Google Cloud TPUs

Dig Deeper on Cloud app development and management

Google Machine Learning Certification Sample Questions

What are tensor processing units and what is their role in AI?

GPUs vs. TPUs vs. NPUs: Comparing AI hardware options

tensor processing unit (TPU)