itestro - Fotolia

AWS launches P4d instances for deep learning training

AWS released its EC2 P4d instances, the tech giant's newest GPU-backed instances. The instances speed up AI training and supercomputing workloads.

AWS on Nov. 3 released Amazon Elastic Compute Cloud P4d instances, its newest GPU-backed instances technology.

Enterprises can use the instances, powered by the latest Intel Cascade Lake processors and eight of Nvidia's A100 Tensor Core GPUs connected by Nvidia's NVLink, to speed up machine learning training and high-performance computing workloads. The instances are generally available now.

Available in one size (p4d.24xlarge), the new instances are capable of 2.5 petaflops of floating-point performance and include 320 GB of high-bandwidth GPU memory. They include 1.1 TB of system memory and 8 TB of NVME-based SSD storage.

Performance increases

The Amazon Elastic Compute Cloud (Amazon EC2) P4d instances offer notable performance increases over AWS' P3 instances, made generally available in 2017.

The P3 instances, which came in three sizes, had 1-8 Nvidia Tesla V100 GPUs, 16-128 GB of GPU memory, 8-64 vCPUs and 61-488 GB of instance memory. Three years later, the P4d instances vastly outperform the P3 instances, offering up to 2.5x the deep learning performance and up to 60% lower cost to train, according to AWS.

With P4d instances, AWS "really understood how important the bandwidth is," providing 400 Gbps network bandwidth, said Peter Rutten, research director of infrastructure systems, platforms and technologies at IDC.

AWS EC2 P4d, Nvidia A100 GPU
AWS' new Amazon EC2 P4d instances are powered by Nvidia A100 GPUs

AWS' Elastic Fabric Adapter, Amazon FSx powered by the open source Lustre system, and Nvidia GPUDirect RDMA, enable users to link P4d instances together in EC2 UltraClusters. With EC2 UltraClusters, customers can scale the instances to over 4,000 A100 GPUs.

That "gives you very powerful compute, much more than was possible with P3," Rutten said. "Basically, you're getting access to a supercomputer on AWS."

Unveiled in May 2020, Nvidia A100 GPUs are powered by Nvidia's new Ampere GPU architecture, and are considerably faster than chips on Nvidia's older Volta architecture. The chip is designed for high-performance computing and demanding AI inferencing and training workloads.

Basically, you're getting access to a supercomputer on AWS.
Peter RuttenResearch director of infrastructure systems, platforms and technologies, IDC

The GPU also backs the Accelerator-Optimized VM (A2) family on Google Compute Engine. Introduced in July, the A2 VMs, with up to 16 GPUs in a single VM, were the first A100-based offering in the public cloud.

"At the heart of these instance lies the A100; so, whether that's on AWS or Google, you get that performance," Rutten said.

Still, bandwidth and scalability are also critical, Rutten said.

Google, in a July blog post, said its A2 system based on Nvidia GPUs can offer bandwidth up to 600 Gbps. Yet, Rutten said he thinks the scaling is lower than the 4,000 GPUs that AWS claimed.

Dig Deeper on AI infrastructure

Business Analytics
Data Management