IT orgs face tricky cost calculus for self-hosted AI inference

Red Hat's twofold strategy to lower AI inference costs -- self-managed hybrid infrastructure and open-weight models -- has potential, experts say, but must be proven in practice.

ATLANTA -- Red Hat AI updates this week aimed to ease the transition from token consumer to token producer for enterprises, but whether self-hosted AI inference will yield long-term cost advantages over cloud-hosted services for the masses is an open question.

While blowing through token budgets for cloud-hosted AI services has become a common experience at large companies, presentations by early adopters of self-hosted AI here this week painted a picture of migration away from public cloud that wasn't for the faint of heart.

BNP Paribas, for example, detailed a multi-year project to move from hybrid cloud AI to entirely self-hosted AI models and infrastructure, which posed significant infrastructure challenges. The bank, which processes some 1.5 billion AI tokens daily, also manages bare-metal server clusters across three data centers for redundancy. Its ambitious goal was to manage bare-metal hardware resources as a service that matched public cloud's ease of use for more than 150,000 end users.

To accomplish this, it manages a fleet of clusters using OpenShift HyperShift, a nested approach that hosts a management control plane separately from a fleet of worker clusters. This came with its own challenges, including making sure that overlay networks and etcd storage on the hosted control plane (HCP) were properly sized, according to Pascal Guerineau, technical architect at BNP Paribas, during a breakout presentation here this week.

"When we began, it was quite a challenge to understand how HCP works and how to manage it," Guerineau said. "We really had to think about the cluster sizing, which was very difficult for us."

The bank is still working to build a dynamically allocated GPU resource pool federated across these clusters, Guerineau said. It is also considering using OpenShift Virtualization to more efficiently allocate GPUs to lighter-weight workloads.

For a large bank, digital sovereignty and control over infrastructure were strong motivators for adopting self-hosted AI despite its complexities, and Guerineau said the total cost of ownership is lower than continuing with cloud-hosted AI.

But calculating the precise cost savings is not straightforward, he said during a Q&A after the breakout session.

"It is difficult to evaluate consistently all the costs," he said. "If you get some GPU machines in the cloud, it's easy to know what it is. If you have in-house GPUs, you have to consider the cost of the servers over the years, paying all the data centers and network [staff] … so this is difficult to communicate."

BNP Paribas Red Hat Summit 2026 presentation
Reps from BNP Paribas present during a Red Hat Summit session with Joe Fernandes, vice president and general manager of Red Hat AI (far left). From left: Pascal Guedreau, Jean-Charles Lamy, Mathieu Keignaert.

OpenShift AI updates respond to early pain

BNP Paribas is arguably an outlier compared to most mainstream enterprises contemplating self-hosted AI inference -- for the average company, Red Hat predicts that a move away from public cloud will be partial, to a hybrid cloud architecture. Thus, not every enterprise self-hosted AI platform will involve self-hosted hardware and data centers, which carry their own management, supply chain and cost challenges.

Our job is to put the easy button on [distributed inference], and hide the complexity, and that's what we're doing.
Brian Stevens, SVP and AI CTO, Red Hat

BNP Paribas's work on self-hosted AI also predates many of the feature updates for OpenShift AI that are meant to make it easier to operate, especially this week's Model-as-a-Service features, said Brian Stevens, senior vice president and AI CTO at Red Hat, in an interview with Informa TechTarget here this week.

"They predated so much of this work around distributed inference, which we [recently] started," Stevens said of BNP Paribas. "Our job is to put the easy button on that, and hide the complexity, and that's what we're doing."

Other Summit presenters reported efficiency benefits from more recent migrations to OpenShift AI. Reps from Turkish bank Yapi Kredi detailed a 2025 move from a Cloudera-based MLOps system to a new shared platform for predictive and generative AI based on OpenShift AI, yielding 50% faster troubleshooting and 75% faster onboarding for its data scientists. In another session, reps from Northrop Grumman said OpenShift Kubernetes operators helped quickly and reliably provision services in its first on-site GPU farm last year.

But while that worked for an initial lighthouse project, the defense contractor is still developing GitOps-based deployment tooling for its broader environment, including air-gapped classified infrastructure, presenters said.

Joseph McConnell, infrastructure automation center of excellence lead at Northrop Grumman, said during the session that he expects cost efficiency benefits over cloud-based services to become clearer as agentic AI increases token burn.

However, during a Q&A after the session, McConnell said those benefits have not yet been specifically calculated.

"Right now, it's kind of a mix," McConnell said. "Honestly, we haven't done the math yet to be specific, but what we're hearing from the vendors is that [when] you get into that regular, huge use of millions tokens per user, that's where it's going to be happening."

The costs of cloud vs. complexity

Despite the difficulties of transitioning to self-hosted AI, some industry analysts are optimistic that new automation features for OpenShift AI will effectively lower the barrier to migration to self-hosting for more mainstream IT organizations.

"[Red Hat AI 3.4 is] a step in the right direction to reducing the fragmentation, shadow AI sprawl, and achieving consistency," said Tim Law, an analyst at IDC. "It removes much of the friction and difficulty from hybrid LLM operations. There are additional hard cost savings associated with removing that friction, as well as soft cost savings."

Varun Raj, cloud and AI engineering executiveVarun Raj

But there are still plenty of risks for workloads as complex as generative AI that could add up quickly for many enterprises, said Varun Raj, a cloud and AI engineering executive working on enterprise AI and cloud transformation initiatives.

"[Red Hat AI] is an important abstraction layer, but not a full 'easy button' yet," Raj said. "Automation does not eliminate the harder enterprise questions: which model to run, whether quality is good enough, how to evaluate it continuously, how to secure it, how to govern outputs, and when self-hosting is actually cheaper than API consumption."

Weighing open-weight value

The value proposition for Red Hat's self-hosted AI inference this week was twofold: it calls for not just more efficient in-house IT systems automation, but also a further move toward quantized, open-weight LLMs and small-language models that are theoretically cheaper and easier to run without high-end hardware.

This transition presents its own challenges. Yapi Kredi has 200 data scientists on its team and had already made the change from SaaS-hosted to self-hosted open source models five years before its move to OpenShift.

That transition was much more difficult, said Osmancan Uslu, head of institutional analytics at Yapi Kredi, during a Q&A after his breakout session presentation.

"When we moved into Red Hat, we were already using an open source architecture, so it was a little easier, but it was really a challenge when we first implemented our risk models in open source," Uslu said.

Yapi Kredi's distributed inference and model training initiative is still in development; it already uses 90% open source and open weight models, but reps did not share specific information about cost savings during a Q&A.

Still, there's evidence that enterprises are motivated to try alternative models and self-hosting in the face of rising costs. An Omdia survey conducted in October found that nearly half of 400 respondents are using open source AI models, said Mark Beccue, author of a November report on the survey and an analyst at Omdia, a division of Informa TechTarget.

Top methods for reducing GenAI operating costs included model efficiency techniques, including quantization, cited by 21% of respondents, and running AI compute workloads on-premises instead of in a public cloud, cited by 18%. Open-weight models were further down the list, cited by 4% of respondents.

"Larger enterprises with decent-sized IT departments will increasingly turn to open source models because they have the resources to work with them," Beccue predicted.

ESG/Omdia Survey results on GenAI costs
An Omdia survey report in November found that enterprises are already pursuing generative AI cost savings through alternative AI models and self-hosting.

Open-weight models and AI agents

Another open question for self-hosted AI is whether open-weight models can effectively keep up with commercial models in the AI agent era, given the reasoning demands of agentic AI workloads.

"Self-managed models will be great for narrow, well-defined tasks like customer service, but may not work for things like agentic workflows," said Larry Carvalho, principal consultant at RobustCloud. "Managing agentic workflows is a new issue, and it will be a matter of time before vendors make it easier to use."

Raj predicted the long-term outcome will land somewhere between specialized open-weight models and larger frontier models as enterprises shift toward AI agents.

"The value of smaller models will come from cost efficiency, control, latency, data locality and predictable task execution -- not from matching frontier models on every dimension," he said. "In that sense, agent adoption may actually increase the value of smaller models, because well-designed agents need a portfolio of models, not one expensive model doing everything."

Furthermore, as frontier model companies shift their focus toward enterprises, Red Hat's Stevens predicted they will also package their models for easier self-hosted use.

"We're not there yet, because the frontier models are printing money with other use cases," Stevens said. "But as enterprise use cases go up and they have more success with AI … they're going to want to capture that business as well." 

Beth Pariseau, senior news writer for Informa TechTarget, is an award-winning veteran of IT journalism. Have a tip? Email her or connect on LinkedIn.

Dig Deeper on Systems automation and orchestration