Natallia Vintsik - Fotolia
Enterprise IT pros have become more selective about which DevOps monitoring tools they use and, thereby, reaped benefits in the form of faster, more focused incident response.
IT analysts predicted this pattern would emerge as DevOps practices matured. Early on, specialized vendors emerged with products for various cloud-native technologies, such as containers and Kubernetes, as well as application-level network monitoring, and IT pros found themselves using a multitude of tools -- sometimes, dozens within the same infrastructure.
Over the last few years, however, monitoring vendors have merged and begun to offer product packages that encompass all three major categories of IT monitoring data: logs, metrics and traces, along with AIOps-driven correlations between them. In their most recent product releases in 2021, DevOps monitoring vendors such as New Relic and Dynatrace added interfaces that target the so-called full-stack developer, reflecting how common that role has become.
Among IT organizations such as Mendix, a division of Siemens that markets a low-code application development platform, this broader market maturation process has also played out internally. Until six months ago, the team that runs the Mendix PaaS platform used a mix of monitoring tools, now also known as observability tools, including Grafana, Prometheus, Datadog and AppDynamics.
Maarten SmeetsVice president of R&D of cloud deployment and operations, Mendix
But as the platform grew to encompass more than 4,000 customer apps running in 10 global AWS regions, as well as private data centers, and on both Kubernetes and Cloud Foundry container infrastructure, the team decided to eliminate all monitoring and observability tools other than Datadog.
The Mendix infrastructure would remain complex, but the company's engineers believed they could at least simplify how they tracked its performance, freeing up developers to focus on more strategic work.
"Due to the large scale that we have, we want to have our people innovate as much as possible -- we want to automate the boring work," said Maarten Smeets, vice president of R&D of cloud deployment and operations at Mendix. "And observability is boring."
Datadog consolidation slashes troubleshooting time
Datadog stood out at Mendix because it's easy to integrate into developers' CI/CD pipelines and is familiar to most of the Mendix DevOps team, Smeets said. That made it well suited to the way the IT organization is set up, where software engineers don't want to rely on a central platform engineering or site reliability engineering team that could become a bottleneck to deployments.
Datadog was straightforward enough for developers to use on their own and covered a variety of back-end infrastructure components equally well, from Amazon services to Cloud Foundry on premises. Its interface is accessible to less technical participants in the software release workflow, but engineers can also use it to dig deep into specific data sets.
But the biggest benefit of the move to Datadog came from the consolidation itself, rather than the particular tool the company chose, Smeets said.
"They all do something well," Smeets said of Datadog's competitors. "The one thing that Datadog does really well is it speaks the language of the people who are actually using it."
Over the last six months, Mendix deployed Datadog's monitoring agent to collect logs, metrics and application traces. This made data gathering faster and more efficient, and all these data types are now displayed together on dashboards that Mendix teams can share, making collaboration easier. Datadog's Log Patterns AIOps feature also helped quickly pinpoint the root cause of issues.
"Before we used Datadog, we basically did log analysis and correlation with home-brewed tools, which was manual and error-prone, with long lead times," Smeets said. "Building our own observability tools is absolutely not something that I think should be our ambition."
When DevOps teams at Mendix still cobbled together their own monitoring tools, pinpointing the root cause of an issue was painstaking and slow, as engineers moved back and forth between separate tool interfaces, Smeets said. Now, troubleshooting typically takes minutes, with much of that correlation done automatically by the Datadog product.
"[Manually] finding correlations between different things going on in our complex infrastructure can take days," Smeets said. "We're now able spot the problem and fix it right away."