your123 - stock.adobe.com
Meta is nearly finished building its own supercomputer for AI research.
Unveiled on Jan. 24 as a fully designed and nearly complete project, the AI Research SuperCluster (RSC) will be used to train large AI models in natural language processing and computer vision for research and development, Meta said.
The goal of RSC is to help Meta build new AI systems for real-time voice translation, research collaboration and to build new technologies for the metaverse, the emerging environment for augmented and virtual reality -- a market that Meta, formerly called Facebook, is seeking to dominate.
Meta released details about the project in a blog post. In an email to TechTarget, the tech giant said it is not disclosing the location of the supercomputer.
Meta's need for the AI supercomputer
Meta needs the RSC to undergird the tech giant's wide array of applications, said Gartner analyst Chirag Dekate.
Since Meta applications -- built around Facebook, Instagram and other platforms -- involve training huge deep learning models, Meta needs to power a large-scale ecosystem to constantly train, update and maintain the models, Dekate said.
Deep learning includes neural network models for image recognition, recurrent neural network models and LSTM (long short-term memory) for video recognition and speech translation.
Chirag DekateAnalyst, Gartner
"You need an AI supercomputer that is not just optimized for one type of model," Dekate said. "It needs to be able to manage a diverse set of use cases. It needs to be able to train different types of neural networks."
Taking advantage of Nvidia's GPU technology
The type of computing ecosystem that Meta has used up to now was more of a traditional GPU cluster, and the supercomputer gives the tech giant a larger, newer generation GPU cluster, Dekate said.
"This is about leveraging the best-of-breed GPU technologies," Dekate said. "I think it enables curation of a shared platform, a shared ecosystem that can help accelerate Meta's diverse use cases."
In its current configuration, the RSC includes 760 DGX A100 systems from AI hardware and software vendor Nvidia that serve as compute nodes containing a total of 6,080 GPUs. The GPUs communicate through a Nvidia Quantum 200 gigabit per second InfiniBand two-level Clos fabric.
The system's storage capacity consists of 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems and 10 petabytes of Pure Storage FlashBlade.
"Whatever cooling technique they choose will be important," said Ezra Gottheil, an analyst at Technology Business Research. "With that many GPUs burning away, this system is going to generate a lot of heat."
In the email, Meta said it values sustainability in terms of designing, building and maintaining facilities that are positive contributors to the community.
High-powered computing systems used for AI, cryptocurrencies and other applications have come under environmental criticism in recent years for their outsized energy consumption.
"We approach sustainability from the ground up -- from design and construction to energy sources, water stewardship, and responsibly managing the end of life of our equipment," the tech giant said.
Meta's partnership with Nvidia enables Meta to use a commoditized ecosystem stack, Dekate said. Nvidia GPUs supports different sets of deep learning frameworks, including TensorFlow, PyTorch and others.
Meta said its supercomputer will be solely for internal use and won't be marketed to outside organizations right now, as opposed to supercomputers from IBM and HPE-Cray aimed at commercial and government users.
Meta said it will continue building supercomputers to meet the needs of its researchers.
Exploring other options
Meanwhile, Dekate said he wouldn't be surprised if Meta is exploring alternative accelerator approaches privately.
It's also possible that Meta might decide a few years from now that the Nvidia GPU technology is not the best for its ecosystem, especially as different types of AI chip ecosystems become readily available to organizations. Those technologies could come from deep neural network vendors such as Graphcore and SambaNova, Dekate said.
A question of ethics
Meta's RSC is critical to the vendor scaling to $100 billion in revenue and beyond, said R "Ray" Wang, an analyst at Constellation Research.
He added that the AI models Meta currently uses are not adequate for the vendor's future ambitions in the metaverse and its core businesses and the supercomputer will help Meta build exponentially bigger models.
Although Meta said it plans to safeguard the data in the RSC, Wang said a big question is how Meta will deploy AI ethics, and meet emerging expectations for AI such as transparency, explainability, reversibility, trainability and ability to be led by humans.
Dan Miller, an analyst at Opus Research, also noted that a mention of ethics was missing from Meta's blog post.
"An investment needs to be made in avoiding bias in training models or algorithms that fuel AI-based functions," Miller said.
Dominating the metaverse
While Meta's AI supercomputer boasts impressive performance numbers, the vendor's objectives seem dated in a way, Miller said.
"It feels like Meta … plans to dominate AI in the metaverse by crunching more and more data," he said.
It would be better for organizations to do more with less and address more vertical or narrower use cases for technologies like NLP and search recognition, "which don't rely on huge amounts of processing power, but solve problems quickly," Miller added.
"If AI-based resources are going to do more and more functions to support our daily lives in the metaverse, we need to make them easy to understand, not create situations where they are doing billions of functions in huge server farms," Miller said.
Organizations that can't build supercomputers will have no choice but to obtain supercomputer processing from other vendors such as Google, Amazon or Microsoft.
"And so now the question is: Does my metaverse compete with your metaverse?" Wang said. "The competitive dynamics as to which cloud you're going to put your metaverse in are going to get even tougher."
Early benchmarks of the RSC configuration, conducted internally by Meta, show the system runs computer vision workflows as much as 20 times faster on Meta's existing legacy production and research infrastructure.
It churns Nvidia's Collective Communication Library about nine times faster and trains large-scale NLP models three times on the same infrastructure.
This level of performance means it can train an AI model consisting of billions of parameters in three weeks compared to the nine weeks it currently takes, the company said.
Despite the lack of proof derived from real-world testing, Meta claims the current configuration is "among the fastest supercomputers" currently in operation and will be the fastest AI-based supercomputer when delivered in June of this year as Meta plans.