AI is trendy in the observability market, but one company saved both troubleshooting time and data management money by using an emerging vendor's product that doesn't include it.
Reveal Data Corp., an e-discovery SaaS provider based in Chicago, is no stranger to AI: The company has built multiple modes of machine learning into its own software and hosts an AI model library for its customers. But when Stephen Montoya joined the company in 2021, tasked with building the company's site reliability engineering practice, he chose a different route.
"AI sounds nice for predictive analytics, but that's a lot of compute horsepower and a lot of data that is needed to drive that," said Montoya, director of software development at Reveal. "A simpler solution is just to do statistical analysis. … You don't always need AI to solve these problems."
In fact, traditional forms of statistical analysis can provide predictions that are just as accurate as AI models but require less historical data -- days' worth instead of weeks, according to Montoya. That becomes important as distributed application data explodes, along with costs for observability data management, he said.
"If you go to containerization, you can bring [10,000 EC2 instances] down to 2,000 machines," he said. "But the log size is going to stay the same -- it's going to be like you have 10,000 machines."
Montoya built his own observability tools from scratch at previous employers such as Conversant Solutions LLC, where he served as site reliability engineering manager from 2014 to 2019. But Reveal didn't have the same skills in-house, and two years ago Montoya went looking for a prepackaged tool from a vendor instead of building his own. He considered products with built-in AIOps features from vendors such as Datadog, Splunk and AppDynamics (now part of Cisco Full Stack Observability), but balked at their pricing.
"A lot of these observability companies have really lost their minds," Montoya said. "They had a model where they were going to house your logs so that they can parse them and figure out different predictive analytics, which [used to] make perfect sense, but if they're going to charge you for that kind of storage, that model doesn't make sense anymore."
Observe adds graph-based structure to data lake
Enter Observe Inc. Founded in 2017, it shipped its first feature-complete product in 2020, and claims to support up to 1 petabyte per day of data ingest as of last month. Observe built its own knowledge graph to structure log data stored in Snowflake's data lake back end using Amazon S3 storage, and doesn't charge for data ingestion, a departure from some competitors such as Datadog and Dynatrace. Users do pay for Amazon S3 space, about $23 per terabyte per month.
The graph approach, as opposed to AI, for analyzing data, means Observe takes a skeptical stance on AIOps similar to that of Honeycomb.io. But Observe doesn't require adjustments to how log data is sent to its back end, while Honeycomb users ship structured events for later queries. Observe builds dashboards and graph-based data topology views automatically, but users can also query their data directly using a specialized language called Observe Processing and Analysis Language. Since the Snowflake data lake separates compute resources from storage, some 50% of the cost of Observe is for accelerated data queries during troubleshooting, according to a company blog post.
Observe charges customers for consumption credits applied flexibly according to their frequency of data access and query speed, rather than for individual observability features. Reveal's Montoya estimated its cost for his company at a tenth of competitors'.
Observe can require more skill from the user to work with than competitors that offer elaborate analytics and interfaces, said Gregg Siegfried, an analyst at Gartner. But for incident response, Observe can do much of the same sophisticated troubleshooting as AI-driven competitors at a lower price.
"There's a lot of things that, say, Dynatrace has had to do with building Grail that Observe was able to get for free by virtue of using Snowflake," Siegfried said. "It requires kind of an educated buyer … but people that have taken the time to understand it have found it very effective at what it does."
Since Reveal first deployed Observe in production 18 months ago, its mean time to repair issues has shortened, though Montoya didn't have an exact measure of how much time has been shaved from that process. But the tool helps narrow searches quickly during troubleshooting, and issues timely alerts based on statistical deviations from normal system behavior.
"It's not just being able to identify something's wrong, but when you're actually trying to figure out what's broken, it's extremely helpful there, too," he said. "I can distill things down to a customer and then start to break things down just by adding filters into the query, and I don't need to know where the log is coming from."
Moreover, the entire software engineering team at Reveal can use the tool in development environments as well as production, which has helped with proactively preventing issues, Montoya said. He also praised strong customer support from the vendor but said he'd like to see it add polish to its dashboards in the future.
"I'm still looking for that Nirvana, 'single pane of glass' [view]," he said. "[It's] just a nice thing to put on [a] TV [monitor] so that people can see that the health of our entire ecosystem is good."