sdecoret - stock.adobe.com
AI Agents' role in IT infrastructure is expanding
AI agents are moving from passive monitoring to active operators, handling incidents, optimizing infrastructure and forcing teams to redesign platforms for autonomous workflows.
There's always been a tension between building and maintaining infrastructure by hand and automating away as much toil as possible.
AI has pushed that balance decisively toward automation. Instead of an SRE who gets jolted awake at 3 a.m. to chase down a failing cluster, agents can now spot anomalies, triage the blast radius, test a fix and roll it out before anyone reaches for their laptop. Ideally, a human simply receives a notification the next morning that an incident was avoided overnight. This is aspirational for many teams but is already taking shape in real production environments.
Natan Yellin, CEO of Robusta, Itiel Shwartz, co-founder and CTO of Komodor and Luca Forni, CEO of Akamas, explain how they are using AI to manage modern infrastructure. Their responses pointed to the same general trend: AI is becoming an operational participant instead of just passively monitoring, and it is taking a meaningful share of the load off platform engineers and SREs. Let’s look at how modern infrastructure operations teams should respond to this information.
Build an internal developer platform for agents, not humans
Internal developer platforms (IDP) are self-service portals designed to abstract Kubernetes complexity away from developers. Yellin's view is that this idea is already obsolete. "Today, platform engineering is about building platforms for AI agents, and the role of humans is to build a platform that agents can use effectively," he said.
In most organizations today, the IDP is curated and maintained by platform engineers who publish pre-configured packages, which are combinations of infrastructure, runtime and guardrails. When a developer has code ready to ship, they don’t raise a ticket, but rather, open the IDP, browse the catalog, pick the package that matches their use case and deploy into it with confidence.
AI is now starting to sit in that platform engineer seat. Instead of a human hand-crafting each new package, the developer simply describes what they need, and the agent assembles a compliant package, runs the checks and greenlights it for production.
A platform built for agents assumes that code will be written, tested, verified and even deployed without human involvement. At Robusta, this has meant building ephemeral environments where agents can preview their own builds, automated test pipelines that produce evidence of correctness, and AI-generated PR documentation -- including screenshots and walkthroughs -- attached automatically so reviewers aren't starting from zero.
The human's role shifts from gatekeeper to final approver, which is a very different job that takes much less time.
Delegate a third of repetitive tasks to agents
Platform engineers and SREs face a constant stream of production incidents, and a significant portion of that workload is mundane.
Events like rollbacks, pod restarts and resource scaling events follow recognizable patterns that a well-built agent can understand and respond to without human involvement. According to Shwartz, "For something like 30% of the cases, Komodor can solve it independently and only give a human an FYI."
The agent follows the same process a skilled SRE would: detect the anomaly, investigate the likely causes, execute the remediation and escalate only when the problem exceeds its confidence threshold. Instead of triggering a pre-defined runbook when a specific alert fires, the agent reasons over the current state of the system and selects the appropriate action. That matters because production environments are rarely tidy enough for static rules to cover edge cases reliably.
That said, Shwartz shared a vital disclaimer: accuracy is imperative for autonomous incident response. Komodor spent time validating its models before letting them act in production. "Accuracy validation is one of the most overlooked things in modern LLM development," he said.
This sentiment is reinforced by recent data from Omdia. It shows that AI-specific quality metrics -- such as hallucination rates and bias scores -- remain underdeveloped in surveyed organizations.
Give agents expanding circles of context
More than model quality, or clearly-defined policies, the quality of context given to an agent is what determines how useful and accurate it actually is. An agent connected only to an observability stack can see only a narrow slice of most real incidents.
Yellin recommends giving agents multiple circles of context:
- Circle 1. Observability (logs, metrics, traces)
- Circle 2. Cloud provider (AWS, GCP, Azure)
- Circle 3. Source control (GitHub, GitLab)
- Circle 4. ITSM (ServiceNow, Freshworks)
- Circle 5. Organizational knowledge (Confluence, Notion)
- Circle 6. Operational databases (IoT device data, Postgres)
Each new ring doesn't just add more data but unlocks a qualitatively different class of problems the agent can reason for.
Manage circles of context deliberately
However, there are constraints to an agent's context. "On one hand I want to give the models more data, more tooling, more MCP. On the other, every time I increase the amount of data they have, things simply don't work or they start to hallucinate," says Shwartz.
Observability data is effectively infinite. Terabytes of logs, metrics and traces will never fit inside any current model's context and can degrade model performance if it's too complex or too noisy.
The solution isn't to limit integrations but to manage the data diet by deliberately distributing it across multiple specialist agents.
Use multiple specialist agents
Komodor and Robusta arrived at the same architectural conclusion independently: A coordinated family of narrow, domain-specific agents outperforms a single model trying to handle everything.
Robusta's HolmesGPT filters, aggregates and slices observability data, giving the model what's relevant to the problem at hand rather than everything that's available. Coding agents, verification agents and the SRE agent run in parallel pipelines, each doing one thing well.
Similarly, Komodor decomposes complex tasks across multiple agents for detection, investigation and remediation, each owning one stage of the pipeline.
Each specialist runs end to end within its own domain or hands off to a peer when the problem crosses a boundary. The architecture is composable, enabling improvement or auditing of one specialist agent without touching the others. This approach treats context management as an ongoing discipline, not something to set and forget.
Use AI to optimize IT spend
Another growing use case with AI agents is their ability to reduce IT spend without degrading system performance.
Forni at Akamas has observed that the real waste in most Kubernetes environments lives not in the hardware layer, but inside the workloads themselves. It's in runtime settings like JVM heap sizes, garbage collection configurations and horizontal pod autoscaler (HPA) thresholds. "No one is actually looking inside the JVM" Forni noted.
After Akamas tuned the JVM configuration of its customers, they report an average of 50-70% cost reduction, with some customers achieving as much as 91% in 15 days. A large e-commerce company saw a 13% improvement in response time on its cart microservice, which translated to $800,000 in savings.
Logging costs follow the same pattern. Yellin described asking HolmesGPT to identify the noisiest logs in a customer's environment. Two large enterprises ran this exercise and found that one log line change cut 20% of their entire logging bill, while another cut 30% with two changes. Human engineers often don't prioritize these tasks because there's always something more urgent.
Key takeaways for IT leaders
Organizations that redesign their platforms and workflows around AI agents now will have a head start over those that treat it as a future concern, using the following approaches:
- Rebuild your IDP for agents. Design your internal developer platform for autonomous deployment, self-verification and agent-driven CI/CD.
- Automate your repetitive 30%. Start with rollbacks, restarts and scaling events and hand off repetitive tasks to AI agents.
- Expand your circles of context: Every new connection unlocks a class of problems your agent can reason about.
- Manage the context diet. Give each agent the right data for the problem at hand, and decompose complex tasks across multiple specialist agents.
- Look where humans don't. Audit your JVM and runtime configurations, and your logging verbosity levels to save costs without affecting performance.
Twain Taylor is a technical writer, musician and runner. Having started his career at Google in its early days, today, Twain is an accomplished tech influencer. He works closely with startups and journals in the cloud-native space.