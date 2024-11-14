SALT LAKE CITY -- Platform engineers face unique challenges supporting their organizations' generative AI workloads, and they're asking open source projects to address them.

In the year since the last KubeCon + CloudNativeCon North America, the dizzying pace of generative AI development has continued. Last year, efforts were just beginning to more efficiently provision GPU hardware using Kubernetes for early AI experimentation; this year, enterprises have already moved through the first cloud-based phase of AI development. Now, some companies want to host more large language models (LLMs) and associated workloads on self-managed infrastructure to cut costs and preserve data privacy.

But, as enterprise IT pros are now discovering, there are good reasons cloud providers with experience running complex infrastructure at scale were the first to shoulder that operational burden.

Not only is generative AI infrastructure complex, often requiring multiple Kubernetes clusters working in concert, connected with equally complex multi-cloud and multi-cluster networks, but it must be highly reliable, according to a keynote by CoreWeave engineers. CoreWeave is a cloud computing startup that specializes in hosting AI infrastructure.

"There are tons of complex physical components that have a direct impact to your hyper-connected training cluster in Kubernetes. Any change in any layer of the stack can impact the cluster health or the job performance, and if something goes wrong … it has detrimental impact to the entire cluster," said Chen Goldberg, senior vice president of engineering at CoreWeave, during the presentation. "Silent data corruption might also occur and can severely affect model quality."

Moreover, "repeat offenders and intermittent failures are not a nuisance," Goldberg said. "They are the obstacle for experimenting and getting things done quickly."

Even when enterprises don't do training for large foundational models, hosting and serving LLMs along with fine-tuning and inferencing workflows place new demands on internal developer platforms, according to a keynote presentation by Aparna Sinha, senior vice president and head of AI product at Capital One.

Despite years of experience running a machine learning platform, generative AI required Sinha's team to add new data services to support the large amounts of unstructured data used by LLMs, additional cross-platform services for semantic search and summarization, user-friendly interfaces to support software developers in addition to data scientists and researchers, as well as updated API management and security guardrails. On top of that, the platform had to remain easy for developers to use and merely lays the groundwork for the next frontier of agentic AI.

Now, as more platform engineers embark on this path, open source tools can be helpful, but require further development, Sinha said.

"If you use closed source [tools], you can actually get most of this platform, or many aspects of it, [with] very little work to be done, and that gives you good time to market," she said. "But on the other hand, open source [has] really started to catch up … and so now you have the ability to create a platform in house that's far more customized and tailored to your needs. … But of course, building up that platform requires an open source community and a number of components that are yet to be invented."