Sergey - stock.adobe.com
GPUs have attracted a lot of attention as the optimal vehicle to run AI workloads. Most of the cutting-edge research seems to rely on the ability of GPUs and newer AI chips to run many deep learning workloads in parallel. However, the trusty old CPU still has an important role in enterprise AI.
"CPUs are cheap commodity hardware and are present everywhere," said Anshumali Shrivastava, assistant professor, department of computer science at Rice University. On-demand pricing of CPUs in the cloud is significantly less expensive than for GPUs, and IT shops are more familiar with setting up and optimizing CPU-based servers.
CPUs have long held the advantage for certain kinds of AI algorithms involving logic or intensive memory requirements. Shrivastava's team has been developing a new category of algorithms, called SLIDE (Sub-LInear Deep learning Engine), which promise to make CPUs practical for more types of algorithms.
"If we can design algorithms like SLIDE that can run AI directly on CPUs efficiently, that could be a game-changer," he said.
Early results found that even workloads that were a perfect fit for GPUs can still be trained significantly faster -- up to 3.5 times faster -- on CPUs.
Shrivastava believes we might be at an inflection point in AI development. The early work in AI started with small models and relatively small data sets. As researchers developed larger models and larger data sets, they had enough workload to use the massive parallelism in GPUs effectively. But now the size of the models and volume of the data sets have grown beyond the limits of GPUs to run efficiently.
"At this point, training the traditional AI algorithm itself is prohibitive [in terms of time and resources]," Shrivastava said. "I think in the future, there will be many attempts to design cheaper alternatives for efficient AI at scale."
GPUs best for parallel processing
Shrivastava said GPUs became the preferred vehicle for training AI models because the process inherently requires performing an almost identical operation on all the data samples simultaneously. With the growth in the size of the data set, the massive parallelism available in GPUs proved indispensable: GPUs provide impressive speedups over CPUs, when the workload is large enough and easy to run in parallel.
On the flip side, GPUs have smaller and more specialized memories. Currently, the best GPU in the market, the Nvidia Tesla V100, has a memory capacity of 32 GB. If the computation does not fit in the main memory of GPUs, the computation will slow down significantly. The same specialized memory that reduces the latency for multiple threads on GPUs becomes a limitation.
CPUs for sequential algorithms
Figuring out how to run more efficient AI algorithms on CPUs rather than GPUs "will drastically expand the market for the application of AI," said Bijan Tadayon, CEO of Z Advanced Computing, which develops AI for IoT applications. Having a more efficient algorithm also reduces power requirements, making it more practical for applications like drones, remote equipment or mobile devices.
CPUs are also often a better choice for algorithms that perform complex statistical computations, such as natural language processing (NLP) and some deep learning algorithms, said Karen Panetta, an IEEE Fellow and the dean of graduate engineering at Tufts University. For instance, robots and home devices that use simple NLP work well using CPUs. Other tasks, like image recognition or simultaneous location and mapping (SLAMM) for drones or autonomous vehicles, also work on CPUs.
In addition, algorithms like Markov models and support vector machines use CPUs. "Moving these to GPUs requires parallelization of the sequential data and this has been challenging," Panetta said.
Rethink AI models
Traditional AI methods rely heavily on statistics and math. As a result, they tend to work most effectively on GPUs designed to process many calculations in parallel.
"Statistical models are not only processor-intensive, they are also rigid and do not handle dynamics well," said Rix Ryskamp, CEO of UseAIble.
Many companies are finding ways to use CPUs to streamline this work. UseAIble, for example, has developed a system it calls the Ryskamp Learning Machine -- after its CEO -- that cuts calculation requirements by relying on logic to eliminate the need for statistics. The algorithm does not use weights in its neural network, eliminating the primary reason neural networks need heavy GPU calculations as well as reducing black box problems.
Ryskamp believes machine learning architects need to hone their skills so they have less reliance on statistical models that require heavy GPU workloads.
"To get new results and use different types of hardware, including IoT and other edge hardware, we need to re-think our models, not just repackage them," he said. "We need more models that use the processors that are already widely available, whether those are CPUs, IoT boards or any other hardware already in place with customers."
CPUs strive to be the AI workhorse
Intel is seeing interest from enterprises in running many types of AI workloads on its Xeon CPUs, said Eric Gardner, director of AI marketing at the firm. Companies in the retail, telecom, industrial and healthcare industries are already adopting AI in their CPU pipeline to perform large-scale deployments.
Use cases include using AI for telemetry and network routing, object recognition in CCTV cameras, fault detection in industrial pipelines, object detection in CT and MRI scans. CPUs work better for algorithms that are hard to run in parallel or for applications that require more data than can fit on a typical GPU accelerator. Among the types of algorithms that can perform better on CPUs are:
- recommender systems for training and inference that require larger memory for embedding layers;
- classical machine learning algorithms that are difficult to parallelize for GPUs;
- recurrent neural networks that use sequential data;
- models using large-size data samples, such as 3D data for training and inference; and
- real-time inference for algorithms that are difficult to parallelize.