Tip

Real-world examples of cloud observability in action

Observability platforms are no longer just IT tools --they're strategic business enablers that directly affect revenue, customer satisfaction and competitive positioning in the market.

By using metrics, logs and traces, IT teams gain unprecedented visibility into their systems, enabling proactive problem-solving rather than reactive firefighting. Businesses implementing comprehensive observability strategies report dramatic improvements. These strategies take advantage of more comprehensive observability tooling that provides a more nuanced understanding of IT systems to streamline troubleshooting efforts and improve resource planning.

Cloud observability has evolved from basic monitoring to AI-augmented platforms. Each kind of new observability capability supports new benefits such as proactive issue detection, faster deployment and more efficient cost optimization.

Randy Armknecht, a managing director and global cloud practice leader at Protiviti, a business advisory firm, said, “Cloud observability has advanced from basic monitoring to intelligent AI augmented platforms. Over the past couple of years, I’ve seen more and more clients embrace unified observability stacks that combine metrics, logs and traces across their hybrid and multi-cloud environments.”

Himanshu Jain, partner in Kearney’s Digital and Analytics practice, a strategy and management consulting firm, has also observed a greater penetration of observability from its roots in technical domains into business domains in use cases like monitoring sales, customer experience and reducing technical debt.

“The penetration of observability into business functions and its usefulness [has] made it a pull vs. a technical push,” Jain said.

Recent shifts in cloud observability tools

Experts are seeing several significant shifts in observability tooling. These are taking advantage of new signals, open source new tracing approaches, AI and FinOps principles.

Titus M, practice director at Everest Group, an IT advisory firm, said some of the more important new trends to watch include the following:

  • Continuous correlations. Metrics, logs and traces are being combined into more sophisticated poly-signal models that now include continuous profiling and real-user experience data. This gives site-reliability engineers context on what failed and who felt it.
  • Open source. OpenTelemetry is now the default wire format, with over 70% of buyers asking for it in RFPs. This ends proprietary agents and simplifies “instrument once, ship anywhere” pipelines.
  • Agent-less. Extended Berkeley Packet Filter-based, agent-less kernel tracing moved into mainstream tools such as Grafana Pixie. This provides deep latency visibility with negligible overhead.
  • AI augmentation. AI and machine learning have pivoted from alert noise reduction to automated root cause analysis assistants. This reduces the mean time to repair by summarizing the causal path for engineers.
  • Financial oversights. FinOps principles are applied to telemetry such as log rehydration, adaptive sampling and tiered storage. This helps teams balance observability depth with cost.

Observability use cases

Here are some real examples of how enterprises are starting to apply these new capabilities to solve problems, reduce costs and improve resilience. All of these were described by observability experts based on specific use cases or a summarization of their experience with many customers.

State agency citizen-services portal outage

Titus did a case study on a state agency that suffered intermittent 503 errors during tax deadline traffic. Data from synthetic probes and real user-monitoring data showed that the supporting Kubernetes infrastructure exceeded the service-level agreement (SLA) thresholds. An analysis of Kubernetes traces identified an undersized API pod. The team doubled the number of API pod replicas and the error rate quickly dropped by 97%.

Container orchestration misconfiguration causes microservices latency

A Kubernetes-based microservices application experienced sporadic latency on public-facing APIs. Traditional monitoring showed the symptom, slower response times, but provided no clear path to the root cause, said Armando Franco, director of technology modernization at TEKsystems Global Services, a technology services provider. A modern observability platform helped the customer trace request paths across services and identify delays at a specific ingress point. Further analysis revealed frequent container restarts due to misconfigured CPU limits.

The observability platform, paired with an AIOps engine, automatically correlated resource spikes with container behavior and flagged the issue before a full outage occurred. The team applied an Infrastructure-as-Code fix and deployed updated resource policies. “What would have taken hours to diagnose manually was resolved in minutes, dramatically improving reliability and customer experience,” said Franco.

E-commerce company identifies a timeout bug.

Rick Clark, global head of cloud advisory at UST, worked with one e-commerce company that was experiencing intermittent checkout failures during flash sales and couldn't identify the root cause through traditional monitoring. They implemented Honeycomb's distributed tracing and high-cardinality analysis tools. This helped them discover that the issue only occurred when specific combinations of conditions aligned.

Specific factors included customers from certain geographic regions using particular payment methods during high-traffic periods. Slicing and dicing their trace data across multiple dimensions simultaneously revealed that a third-party payment API had different timeout behaviors for specific regional endpoints. The fix involved implementing circuit breakers and adjusting timeout values for those particular conditions.

Reducing downtime and improving resolution

Dugan Sheehan, distinguished senior director of Azure product engineering at Ensono, a managed IT services provider, said they use several tools to reduce downtime and improve resolution activities. For example, every alert that gets generated goes through a predictive engine to determine the likelihood of the alert becoming a major incident. Based on the scoring, on-call activities can be set in motion. Next, historical alerting and change information are layered in to help diagnose the issue. This comes in the form of open ServiceNow changes, related alerts and tailored knowledge base articles.

If the issue is recurring, a problem ticket can be suggested to try to fully identify and address the root cause. In some cases, further diagnostics can automatically be retrieved to provide final decision-making criteria. For example, in a high CPU situation, Ensono has an automated call to Datadog to pull historical process utilization metrics for the same alerting time period. This information is then provided to the customer to determine if a restart or resize is required.

Latency spikes caused by API misconfiguration

Randy Armknecht, managing director at Protiviti, a global consulting firm, worked with one financial services client that faced latency spikes in their portfolio dashboard. Using distributed tracing and real-time metrics, observability tools identified a misconfigured API gateway that was throttling requests. Once they had identified the problem, they were able to restore performance within minutes of the changes being pushed live.

Plugging cloud cost overruns

In another project, Armknecht worked with a client who uses observability to identify idle compute from forgotten pilot projects that were causing cloud overages. In this case, observability platforms collecting cost telemetry were able to capture information about efficiency. This guided the team in reallocating the workloads and reducing costs by 30%.

 “These solutions typically involve integrating observability into CI/CD pipelines, setting policy-driven alerts, and aligning insights with business outcomes,” Armknecht said. This is an example of how observability is being increasingly used outside of technical domains to support business and finance teams in making better operations and finance decisions.

We’ve also seen increased demand for tailored observability frameworks that align FinOps, compliance and business KPIs, making observability a strategic enabler rather than solely a technical necessity.
Randy Armknecht, managing director at Protiviti

“We’ve also seen increased demand for tailored observability frameworks that align FinOps, compliance and business KPIs, making observability a strategic enabler rather than solely a technical necessity,” said Armknecht. Randy Armknecht, managing director at Protiviti

Diagnosing medical misinformation

Mikael Quist, CTO of Qoob, a developer of specialized data centers purpose-built for AI and GPU Cloud workloads, walked through a hypothetical example of how new AI observability tools could help troubleshoot hallucination issues. In this case, a healthcare provider could use specialized LLM observability platforms, like LangSmith that support semantic evaluation metrics, that can highlight deviation from medically accurate responses.

By analyzing prompt and response logs, alongside real-time cost dashboards monitoring model interactions, engineers can swiftly trace the issue back to the recent prompt change. This information guides them in reverting the problematic update and implementing retrieval-augmented grounding to enhance factual accuracy and stability to resolve the hallucination issue before it impacts patient safety.

George Lawton is a journalist based in London. Over the last 30 years, he has written more than 3,000 stories about computers, communications, knowledge management, business, health and other areas that interest him.

Dig Deeper on Cloud infrastructure design and management