AI is changing cloud operations, but not in the way orgs expect

The defining question for organizations today is no longer whether AI can be deployed, but whether it can be operated reliably, safely and in alignment with business goals at scale.

As enterprises accelerate investments in generative and agentic AI, many are discovering that deploying models is only the starting point. The real challenge emerges once these systems move into production.

Recent industry findings from organizations such as the FinOps Foundation and Flexera highlight growing concerns about cost control, visibility and value realization in cloud environments. For example, the "Flexera 2026 State of the Cloud Report" continues to show that a majority of organizations struggle with cloud cost management and lack consistent visibility into usage. These issues become more complex with AI-driven workloads. These challenges are becoming more visible as AI systems move from controlled pilots into production environments at scale.

"AI systems don't just run -- they interact, evolve and make decisions across services in production environments. That changes the problem from deploying workloads to OSes that continuously produce outcomes," said Varun Raj.

In this interview, Varun Raj, a cloud and AI platform leader working on large-scale enterprise systems, including AI-driven systems, describes a growing gap between how organizations build AI systems and how they operate them in production.

How is AI influencing cloud decision-making at the enterprise level?

Varun Raj: AI is no longer just another workload -- it behaves more like a system that makes decisions, interacts across services and evolves over time. That shifts enterprise focus from where workloads run to how systems are controlled.

Cloud decisions are now driven by the ability to support dynamic behavior, continuous monitoring and alignment with governance frameworks. This represents a shift from infrastructure-centric thinking to operating-model readiness.

Where are organizations encountering the most difficulty once AI systems move into production?

Raj: The biggest challenges emerge after deployment, when systems begin interacting with real-world environments. Across enterprise implementations, three areas consistently surface:

  • Infrastructure limitations.
  • Integration complexity.
  • Lack of runtime control.
Varun Raj

Cloud environments were designed for predictable, transactional workloads. AI systems behave differently -- they are iterative, stateful and sometimes unpredictable. At the same time, they depend heavily on integrations with APIs, data platforms and business workflows. Even small inconsistencies can cascade into larger operational issues.

How do infrastructure and integration challenges affect business outcomes?

Raj: These challenges translate directly into business risk. Infrastructure constraints lead to inconsistent performance, which impacts reliability and user trust. Integration gaps can result in unintended or incomplete actions, introducing compliance and financial risks.

For C-suite leaders, this shows up as operational uncertainty, making it harder to predict outcomes and ensure alignment with business objectives. This is why organizations are beginning to treat AI systems not as isolated applications, but as operational platforms that require continuous management.

How are cloud platforms evolving to support the unique demands of AI systems in production?

Raj: This shift is happening in several ways:

  • Greater emphasis on orchestration layers to manage interactions across services.
  • Integration of observability tools that track not just performance, but system behavior.
  • Support for feedback-driven architectures that allow systems to adapt in real time.
  • Separation of execution and decision layers to improve control over outcomes.

The focus is moving beyond compute and storage toward coordination, control and alignment. Organizations that succeed will be those that can manage system behavior continuously and ensure AI-driven outcomes remain aligned with business intent.

Are traditional governance approaches falling short?

Raj: Traditional governance relies on predefined rules, static validation, and periodic reviews -- approaches designed for deterministic systems. AI systems don't operate that way; they adapt, interact, and evolve based on context.

What we're seeing is the need for a shift toward runtime control, where systems are continuously monitored and adjusted during execution rather than only before deployment.

In many cases, systems appear operationally stable but behave in unintended ways -- reflecting a growing gap between system performance and system behavior. This is less about enforcing rules upfront and more about managing behavior as it unfolds.

Are there any mistakes C-suite leaders keep making with AI?

Raj: One of the most common mistakes is treating AI initiatives as extensions of traditional software projects. There's often an assumption that once systems are deployed, they will behave predictably. In reality, AI systems introduce new forms of uncertainty, especially when interacting with dynamic environments.

Another mistake is over-indexing on model performance while underestimating operational complexity. Many organizations invest heavily in building models but not enough in the infrastructure, integration, and control mechanisms required to operate them reliably.

What should C-suite leaders prioritize to ensure AI systems align with business objectives and deliver value?

Raj: Three priorities stand out:

1. Visibility. Leaders need insight into how systems behave, not just whether they run. This includes understanding decision patterns, consistency, and alignment with expected outcomes.

2. Control. Organizations must be able to intervene and adjust system behavior in real time. Static controls are no longer sufficient for systems that evolve during execution.

3. Integration. AI systems must be tightly aligned with enterprise workflows, data sources and governance frameworks.

From a metrics perspective, traditional KPIs like uptime and latency are no longer sufficient. Organizations are increasingly tracking:

  • Decision accuracy in context.
  • Consistency of outcomes across environments.
  • Behavioral drift over time.
  • Impact on business processes and outcomes.

These metrics provide a more meaningful view of whether AI systems are delivering value.

What changes are organizations making to address these challenges?

Raj: Organizations are adopting patterns aligned with distributed systems and cloud-native operations. These include:

  • Continuous monitoring of system behavior.
  • Feedback loops to detect and correct deviations.
  • More structured integration layers.
  • Separation of decision logic from execution environments.

What’s emerging is a shift toward actively managing AI systems rather than simply deploying them.

Kathleen Casey is the site editor for SearchCloudComputing. She plans and oversees the site, and covers various cloud subjects including infrastructure management, development and security.

Dig Deeper on Cloud infrastructure design and management