Getty Images

Datadog shops: AI incident management needs platform engineers

AI-driven incident management will only be as good as the human-designed platforms they run on, according to conference presenters at Datadog DASH.

AI incident management agents like those touted by Datadog at its recent DASH conference aren't magic. Behind every successful IT automation is a human platform engineer, according to the presenters at last week's DASH breakout sessions.

Similar to attendees at Cisco Live earlier this month, the DevOps and platform engineers whose sessions were livestreamed on June 10 are taking a cautious approach to adopting autonomous AI agents for incident management. While some engineers are experimenting with Datadog AI tools, their focus remains on building a clean platform foundation for production reliability before expanding agentic automation.

Krafton Inc., makers of the PUBG: Battlegrounds game played by millions of users worldwide, has spent nine years refining its incident management process, according to Junghun Kim, DevOps engineer at the Seoul-based video game publisher, in a DASH breakout session presentation.

Part of that refinement has included consolidating its original set of five observability tools into a single platform with Datadog, but Kim said changes to how application and SRE teams handled incident workflows were more important.

"The incident platform is not some kind of in-house product or internal platform tool," Kim said. "It's more like a mental model and an ecosystem around the incident response."

That model has three foundational layers before AI even enters the picture,  let alone autonomous AI agents, Kim said,.

"We want to reach that goal someday, but today AI can still make wrong judgments during incidents, and if it takes critical action that's hard to roll back, the risk is too high for production reliability," Kim said. "So, for now, our focus is on AI assistance. AI doesn't replace the human responder, and it helps responders work faster on top of the system we already built, which includes [context data about] code ownership, [incident] severity, and guardrails."

This result came from the response platform that we built, not some sort of AI magic.
Junghun Kim, DevOps engineer, Krafton Inc.

Krafton uses Datadog's MCP server and newly released Pup CLI to give coding agents access to its incident context data in Datadog, Kubernetes, Atlassian Jira and Slack. They assist human operators with tasks including cross-source debugging for root-cause analysis, drafting incident postmortems and runbooks, and documenting weekly on-call handoffs. The company's total incident volume has shrunk from 107 in 2024 to 24 so far in 2026. Time to detect an incident has dropped from 8.8 minutes on average to 1.6 minutes, and mean time to repair has dropped from 53.5 minutes to 10.3 minutes.

"The trend is clear across every number: earlier detection, faster repair, and less user impact," Kim said. "This result came from the response platform that we built, not some sort of AI magic."

Swish: No substitute for platform foundations

Platform engineers from Getswish AB, makers of the Swish Swedish payment processing app, told a similar story of incident management improvement in another livestreamed session. These engineers spearheaded the creation of a new internal incident management platform for the company after a 2021 outage was reported in the press before the company's IT management outsourcing provider was aware of it.

By late 2024, another high-profile incident made national news but was detected and resolved by an SRE team within five hours. Getswish is now working on extending observability and incident management practices to its financial transaction partners.

Between those two points came a comprehensive effort to rearchitect the company's code development process using GitOps and incident management practices using Google's seminal manual on SRE, said Jonas Cronholm-Lundin, head of platform at Getswish AB, during the June 10 DASH session.

"The SRE book that Google published [10] years ago, has a prioritization triangle," Cronholm-Lundin said, referring to the book's Service Reliability Hierarchy, represented by a pyramid, with monitoring at the base "If you have problems with reliability in your system, where do you start? … Adding on release procedures or performance testing and whatnot, will not do any good if you don't build your base in the triangle first. You need to actually understand how production runs and how it works. That is where everything like this needs to start."

The speakers also emphasized the importance of laying platform groundwork before AI agents can be effective. The company's revamped incident management process also now includes runbooks structured according to the OODA loop approach to decision making, Cronholm-Lundin said.

"When agents are coming in, like Bits [AI] SRE, and so on, if you have gathered your post-mortem reports [and] curated good runbooks, that is excellent for your agent to consume in the future when you have future incidents, and also for onboarding purposes," he said. "So it's super important to do the groundwork of actually bootstrapping your internal documentation in this area."

Data classification agents raise fears, opportunities

Datadog made more than 100 updates to its observability platform during DASH, several of them offering an AI-assisted approach to data gathering. These include auto-tagging features in its new Runtime Prioritization Engine for security vulnerabilities and a new Auto-Processing feature for log management with its Observability Pipelines.

Reached by email following the presentation, Cronholm-Lundin told TechTarget that his team is following Datadog's agentic updates with interest and plans to evaluate some, but it's too early to say which will be relevant to the company.

Another DASH attendee who saw a demo of Auto-Processing said he was concerned about a potential loss of skills among engineers as AI-driven automation takes hold.

"My takeaway was that developers no longer need to have discipline in what they generate for log output or how they format it; AI will take care of all the details so humans can focus on iterating faster and shipping more code," said Eric Swanson, senior site reliability engineer at MagicSchool AI, makers of a generative AI platform for education in Denver, Colo., in an online interview during DASH.

"My concern with these types of solutions is that we are choosing to hand our agency over to Agents, and blunting the sharpness of our skills honed through critical thinking," Swanson said, emphasizing that he was speaking on behalf of himself, and not his employer. "The skills themselves may become obsolete, but we run the risk of dulling our thinking next."

 However, one breakout session presenter who has worked extensively with Observability Pipelines noted that AI can offer new opportunities for engineers to improve incident management efficiency.

Not every log generated by applications at US Bank gets a "first-class seat" in Datadog's back end, but Observability Pipelines enables the platform to route the data to other repositories to manage retention costs, such as Amazon's S3, said Bryan Pierson, vice president of enterprise observability engineering at US Bank, during his DASH presentation.

But Pierson's team also came up with a way to route logs to Datadog in the event developers need its AI features, he said.

"Right before DASH, we ended up putting out a proof-of-concept application called Kickflip that puts the power back in the application team's hands to enable their debug logs if they do need them on the platform for use with things like Bits AI," Pierson said.

Beth Pariseau, senior news writer for Informa TechTarget, is an award-winning veteran of IT journalism. Have a tip? Email her or connect on LinkedIn.

Dig Deeper on IT systems management and monitoring