No matter how many tools are in your enterprise monitoring stack, you're likely paying an observability tax, according to one startup's CEO.
Google Trends indicates that the term observability tax has been rising in popularity since 2025, though it likely dates to 2023. It reflects the often high costs of gaining observability insights from disparate monitoring tools. These costs come from fees organizations pay for features within these tools, such as individual seats or dashboards within a platform, data ingestion and integrations among various tools.
"I don't know who coined the phrase, but I do love it," said Ari Zilka, CEO at MyDecisive.ai, an open source observability platform with a data pipeline engine for OpenTelemetry.
In this episode of IT Ops Query, Zilka described the MyDecisive.ai SmartHub platform as a remedy to the observability tax. The platform aims to eliminate what Zilka sees as a tradeoff between total cost of ownership (TCO) and mean time to resolution (MTTR) that organizations often face with other observability platforms, such as Datadog and Splunk.
"If you want to solve problems fast, you need a ton of data," Zilka said. "If you give them a ton of data at $2.50 a gig, your tax is super high. The less data you give them, the less they can help you resolve an incident."
The MyDecisive.ai SmartHub performs streaming analytics on in-flight telemetry data to reduce massive data volumes to only the most critical alerts. It uses intelligent alerts and trace sampling along with its Intelligent LogStream tool to identify issues, which users can then export and forward to their observability SaaS vendor.
Observability is built for humans to look at dashboards, and MyDecisive is built for robots to take action and keep systems healthy.
Ari ZilkaCEO, MyDecisive.AI
"We are asserting that the only way to eliminate the observability tax is to come into your environment, give you total cost of ownership control and give you MTTR reduction," Zilka said.
The concept of observability tax is especially relevant given recent SaaSpocalypse rumblings, Zilka said. SaaSpocalypse refers to an increased prevalence of SaaS customers using AI agents to create or customize their applications, which can reduce or even eliminate reliance on their SaaS vendors.
"I would love there to be a SaaSpocalypse driven by AI," Zilka said. When it comes to observability SaaS, he argued that many vendors offer data pipelines that focus too much on analytics rather than action, which can slow remediation.
"Those decisions have to happen fast, and you need to take the human out of the loop," he added. While there's inherent risk in that, Zilka noted that human-in-the-loop processes also carry risk and introduce errors. The upside of AI, though, is that it can make fewer errors while processing data on a much larger scale than humans can.
"Observability is built for humans to look at dashboards, and MyDecisive is built for robots to take action and keep systems healthy," he said.
Watch this episode of IT Ops Query for more on the observability tax and the TCO vs. MTTR tradeoff, how MyDecisive.ai positions itself against competitors, and balancing the risks of AI with the benefits of taking humans out of the loop.
Kate Murray is a managing editor with Informa TechTarget's Infrastructure editorial team. She joined the company as an associate managing editor of e-products in 2020.
Transcript - The 'observability tax,' AI and the SaaSpocalypse
Beth Pariseau: From Informa TechTarget, I'm Beth Pariseau, and this is IT Ops Query.
This podcast distills the signal from the noise about enterprise software development and platform engineering. Each week, we'll talk to expert guests about the latest tech industry news and trends that engineering and IT leaders need to know.
Don't forget to subscribe to IT Ops Query for more conversations on AI and the future of the enterprise digital workspace.
Ari Zilka is managing partner at Razor's Edge Investments and CEO at MyDecisive.ai, which created a data pipeline engine for OpenTelemetry called Smart Telemetry Hub.
Before MyDecisive, Zilka worked as general manager of the New Relic incubator, partner at Khosla Ventures, and CPO/CTO at Hortonworks.
MyDecisive's platform supports environments running OpenTelemetry, Kubernetes and Prometheus. It performs streaming analytics on telemetry data as it's ingested, reducing its volume, generating proactive alerts and triggering runbooks to remediate issues on the fly.
In November, MyDecisive donated its Datadog logs ingest tool to the OpenTelemetry Contrib Collector. Say that five times fast.
Hi, Ari.
Ari Zilka: Hi, Beth.
Pariseau: Thanks for joining today. I really appreciate you taking the time.
Zilka: Likewise.
Pariseau: So, I noticed poking around a little bit on the MyDecisive website that, you know, it does support many vendors' observability repositories -- including, I think I saw mentioned, Splunk and Databricks. But a lot of the messaging, at least in recent months, has focused on Datadog, whether it's the open source donation or a case study on the website about a healthcare company that cut nearly $2 million per year from its Datadog bill. So, why that focus?
Zilka: Great question, Beth. It's pretty simple for me personally, but I don't think it's obvious to the non-insider. If you Google, you'll find this term out there called the 'observability tax,' and there's over a thousand articles on it now. I don't know who coined the phrase, but I do love it. I feel like companies are in massive observability debt. And they're paying that tax mostly to some combination of Splunk and Datadog.
These are the biggest vendors in the space. People need the most help with them. The Datadog product, like, if you want to retain data for 30 days, they're basically a database in the cloud with a glass screen attached to it so you can chart your CPU utilization.
They've added a ton of whiz-bang features over the last 15 years. They're a really smart product team, but you're paying $2.50 a gigabyte just to send them data before you extract any value from it -- $2.50, plus a bunch of other charges. So, we're targeting Datadog because we believe that's where the market's at. We believe that's where the pain is at, that's where the need is greatest.
Pariseau: OK. And then, of course, devil's advocate time: Datadog and other observability vendors have their own data pipeline tools and OpenTelemetry support, so why would a company add MyDecisive or use it instead?
Zilka: Look, there's a tradeoff that observability doesn't talk about but is real, and it's at the heart of the observability tax. That tradeoff is what I call a TCO vs. MTTR tradeoff. If you want to solve problems fast, you need a ton of data. If you give them a ton of data at $2.50 a gig, your tax is super high. The less data you give them, the less they can help you resolve an incident.
And so, they have provided you a tool that helps you save money, that sort of alleviates the tax, but it just trades off by making it harder to respond to incidents because the data's now gone. So you can't go to Datadog, you can't go to Splunk and just blindly filter data. You have to have a smart hub that comes into your cloud and keeps track of 100% of the data so that it can extract the pure signal and forward only the signal.
I'm not here to attack Datadog and say, 'How dare they tax people?' They're providing value. I don't question the value; I question the approach of 'send 10% of your data and then cross your fingers that the signal, the root cause analysis, is in there.' It's not. And I know that from my seat at their competitor -- their chief competitor, New Relic -- for five years. The signal wasn't there, and that's because they're all sampling because you're all crying that they're too expensive -- 'you' meaning the customers -- are all crying that they're too expensive.
If you look at TCO vs. MTTR, you want the lowest total possible cost of ownership and the shortest resolution time possible. The best way to do that is to come into someone's cloud, look at all the data, throw away all the duplicative information, all the noise, and forward to Datadog 'here's an incident.'
Pariseau: Hm. I mean, is it fair to compare MyDecisive's approach to -- the thing it reminds me of is how Cribl got its start reducing Splunk costs.
Zilka: It's fair. I would call Cribl an ETL tool. It cannot keep track of all the data. It has a tiny, tiny piece of memory that's good for a few seconds of telemetry data flowing by. MyDecisive has got an inbuilt partitioned memory grid. So, it could keep track of -- that case study you mentioned? We're keeping track of a million requests every minute for their highest-traffic service. It's a huge consumer offering.
And a million requests a minute requires, in MyDecisive, something like 10 gigabytes of RAM. That's really cheap when paid for as an Amazon compute instance on that customer's dime. But if you were trying to keep -- so, Cribl can't keep track of 10 gigs of information in memory. It's an ETL tool, bytes in, bytes out, translate them, transform them, move on.
MyDecisive is: Bring me a bunch of data, let me hold on to it for a couple of minutes, figure out what are the errors, what is normal, throw away the normal because that's noise, and take the errors and forward them on to your automations or your vendor of choice.
That means we come in for the same value prop. We come in introducing the same value prop as Cribl -- save money -- but with a different approach. Instead of filtering, it's extract pure signal.
Pariseau: OK. There's also companies like StarTree that do real-time streaming analytics on observability data, so is this kind of combining some of those elements as well?
Zilka: Absolutely. They're -- the point of our system is to drop in in a minute. So, when you look at analytics, and as you said in the intro, I was at Hortonworks and, you know, if you break it down this way…
I wanna come in from a different angle. I was about to answer it one way, Beth, but for me, it looks like this. From the user's perspective, you have some choices. You have this observability tax. You can get rid of the observability tax by going open source. But that just is a Capex/Opex tradeoff. You switch vendor costs for headcount costs inside your organization, running the open source footprint, and you still have a bunch of holes in the technology.
If you look at an analytics-centric platform or even an AI site reliability engineer, AISREs, then you're still keeping your observability vendor, which means the tax is still there. You're layering in extra capabilities and, in many cases, a human still has to come in and help do something with the potential signals that pop out.
And so here, we fundamentally address the tradeoffs between seeing all the data and, you know, total cost of ownership by bringing ourselves into the customer's environment and being completely telemetry-aware. So, there's no setup, no headcount costs, no hidden costs with us because the AI in MyDecisive.ai actually understands observability data out of the box. It doesn't have to be told what's an alert condition, like, 'This much CPU is too much CPU.' Or 'This many nodes is too many nodes, page me so I can help.' Like, no, we know what too many is because we know what you've been doing for the last 24 months, last 15 months.
We have a seasonality model built into our AI, and we know normal and we know abnormal. We don't need any setup, any configuration, install in a few minutes, versus an analytic system like StarTree where it's ready to answer any query you may have. We're the ones doing all the querying, not the people.
Pariseau: OK. Speaking of AI, is it machine learning AI, or is it GenAI or is it a combination? What is the AI part of MyDecisive.ai?
Zilka: Great question. So it has to, a priori, be statistics in machine learning for a specific reason, which is explainability. Right? You can't throw it into gen -- you can't throw telemetry data into a generative soup. And then, you might be able to spit out later, 'This is abnormal.' But you can't answer why is it abnormal.
So with us, we've built a model hanging off of Google's site reliability best practices documentation from 15 years ago: error rate, response time, throughput. What is this system doing right now? And what did it do this same time an hour ago, yesterday, same time last week, last month, last quarter and last year?
If you can answer that in a millisecond, then as the telemetry streams by, you know if you've got a problem or not. Right? With any specific signal on any specific component architecturally. So it might be application code, it might be a container, it might be a physical host, it might be a network load balancer. But you need to be able to say how many nodes were there, what version of the code was running, how many requests was it handling, how many database calls did it make, how many downstream service calls did it make, and how much CPU and memory did it consume to do that minute worth of work.
And so that is a statistical model, if I'm direct with you. And there's nothing wrong with that, because now I can feed it into generative and spit out normal/abnormal. Much, much lighter weight, much, much faster. You're talking about like a 4-gigabyte model instead of like 40 petabytes, if you try to get all the raw data.
Pariseau: OK. And just getting back to the observability tax, there obviously is gonna be some business model for MyDecisive. So, what is that business model? What do customers pay you for?
Zilka: So, our customers today pay us a support contract because not only are we built using CNCF technologies, like OpenTelemetry itself, we're also built with Kubernetes; we have Prometheus; we offer snap-ins to databases like GreptimeDB, which is a telemetry-specific type of ClickHouse-style database, if I were to oversimplify great technologies; Argo CD.
We stand on the shoulders of giants, and from those shoulders, we ourselves pay it forward and give away our data filtration, our observability tax tools for free, and they're 100% open, in the same permissive licenses as everything, no copy left, nothing like that.
And that said, right now we just ask for a support contract if you deem it necessary. Long-term, we intend to monetize the AI side. So, if you think about, like, how many times could an automation help you from avoiding an incident, that's what we want to charge for -- not the volumes of data under management, but the amount of automated savings we generate. That's the plan, but we're still sussing that.
Pariseau: And, you know, it's interesting in the year 2026 to see, you know, rules-based runbooks essentially, where a lot of vendors are touting AI agents for things like automated remediation workflows. Is that something MyDecisive might add in the future?
Zilka: I'm not a fan of runbooks. Rules-based, maybe, but we're trying to build our system in a way that the system generates the runbooks with full auditability and visibility. So, because we built ourselves on this tight, tiny statistical model, we could basically say, 'This system went bad at this time on this signal.' So, the error rate went up.
And the important thing that we're missing together in this conversation, Beth, is I also need the source control lifecycle, the software lifecycle. So, I integrated to GitHub, GitLab and we know when the source is changing.
So, for a company that's modern and immutable and following GitOps best practices, then everything to do with their production is in code or config. And when they want to change something about their Kubernetes cluster or their application logic, they're going to GitHub to do so. We know that they've made a change, we can diff the before/after. And we want to help people build the following type of runbook, which is we want them to break down their releases into components.
Because one of my biggest beliefs is destructive runbooks are bad runbooks, ones that they restart, ones that say, 'Do a rolling restart if this error.' I'd rather avoid the error by saying, if I push out the database, and the application slows down, stop there. If I push out the database and the application and everything's OK, then push out the new client application. So, I'd rather work on a roll-forward model than a rollback model. I'd rather work on an 'add nodes' or 'fail over to a healthy cluster' than 'do a rolling restart' type of thing.
And with my Walmart.com background, you know, 20 years ago, everything was a destructive runbook. If this comes out of the monitoring tools, then start restarting the cluster. If that comes out of Oracle Enterprise Manager, then, or Spotlight from Quest, then reboot the database, or fail over to the other database instance and start, you know, rolling back the data access tables. Page the data team, page the application team, page the networking team, they've got to fix something.
And as soon as humans touch the system, as soon as humans are asked to draw loose conclusions from telemetry, you're at a path toward error. So I'd rather say build runbooks about how you release than build runbooks about how you remediate.
Pariseau: Ah, OK.
Zilka: I will build the remediation runbook if you build a stepwise release strategy, for example.
Pariseau: Is there a place for an AI agent in that mix at all?
Zilka: Someone asked me that by coincidence last night, and I came to a conclusion after a couple of beers, which was you literally -- MyDecisive doesn't care if the system's being run by humans, and humans are reacting to our events and our detection opportunities, or AI is. If AI is writing all the code in your enterprise, in your super-modern, 6-month-old company, you might be doing exactly that right now.
But if AI is writing the code, because of our integration point, being the source control and being able to tell the difference between healthy and unhealthy, I can interact with agents and say, 'That code you wrote is slow.' And those agents can say, 'Oh, OK, well, I should do a performance optimization on it.'
You know, I'm not gonna give you signals like a product analytics tool that would tell you like your users don't like that feature. But you would want to have closed loops. If an AI agent were building your user experience, you'd want a closed loop that says the users like the feature or don't like the feature, and they're voting with their actions, and product-led growth causes us to optimize the product in a tighter and tighter loop. Same for performance, same for availability.
So, I would like to feed into those agents that they are writing code that's slow, exactly how it's slow, which version of the code. If you're a human or you're an AI agent, I can give you the same telemetry.
Pariseau: But, correct me if I'm wrong, it doesn't sound like MyDecisive is offering its own agents at all yet.
Zilka: The agents we're building are going to be, in 2026, are going to be focused on runbook generation and automated control of production environments. We're not building, like, pipeline generation agents or telemetry optimization agents or dashboard authoring agents, things like that. Correct.
Pariseau: OK. And then, kind of at a higher level, I happened to see a post that you shared on LinkedIn that was a video about the topic of an AI bubble. And your comment was, 'People have been asking us why the '.AI' in our name and if we should be distancing ourselves from that form of go-to-market. I will just say this for now and comment more deeply later. Lots to think about here, definitely thinking on this.'
That was about a month ago. So, how are you thinking about that now?
Zilka: Yeah. I'll be direct, like, I got scared a month ago with the data that was coming out in the declarations of an AI bubble. I'm more calm now. I do think that, you know, the anthropics -- Claude, ChatGPT, Gemini -- they are here to stay for sure.
I think that Claude Code is way more powerful than people realize. And there are, in fact, companies -- I think two years ago, I asserted AI is not gonna write back-end code, it's gonna write front ends. Now we're seeing AI writing back-end code.
I have no fear of AI. We have no fear of AI. We let AI generate pieces of our code. We let AI generate our product strategy. Works wonderfully. The bubble part is the thing I'm kind of more calm about. Like, AI is here to stay. Have the giants overinvested in AI, bought too much real estate, bought too many servers, overextended themselves capital-wise ahead of the market need? Absolutely.
It's gonna get absorbed into the noise. I think, in the next two years, they will find that they were a little bit ahead by quarters, but they're all big enough to afford to handle that kind of early, you know, early execution.
Pariseau: Yeah, I mean, the latest thing to get panicked about has been the SaaSpocalypse. I don't know if you followed any of that, but what do you make of that idea? Do you think you know, the Salesforces--
Zilka: I would love there to be a SaaSpocalypse driven by AI.
Pariseau: Oh yeah?
Zilka: Because we brought our SmartHub into the customer's cloud because we're tracking 100% of the data. We are asserting that the only way to eliminate the observability tax is to come into your environment, give you total cost of ownership control and give you MTTR reduction.
I mean, there's no such -- look, I sold a bunch of software in my career, helped people build a bunch of architectures. I'll never forget a giant telco in the U.S. where they felt they had egg on their faces because they went all in on SOA. Remember service-oriented architecture?
Pariseau: I do.
Zilka: And they said, 'We are in trouble, Ari.' I was about to use poor language.
'We are in trouble, Ari, because we bought too much infrastructure for our service-oriented cloud, and we were ready to provide SOA as a service to our customers.' This is 2012. 'And SOA died, and now we're gonna repurpose it to be a data lake architecture for your Hadoop clusters, Ari.'
And I'm like, 'Well, we've got a problem. Your SOA clusters don't have the right CPU storage mix for a data lake.' And they're like, 'Well, I can't throw it away, so we're gonna use it anyway.'
Fast forward to today, everyone says the declaration that SOA was dead was lunacy because everything is service-oriented nowadays.
Pariseau: Absolutely.
Zilka: One hundred percent of architectures are service-oriented and mesh-based, and HashiCorp brought us there, and Kubernetes brought us there, and Amazon brought us there.
So, you could say it's what goes around comes around, but I think architectural principles -- like SaaS is an architectural principle to me. Like, let me offload X because I have no business owning this, I just want to get onto the experience without any ownership of the infrastructure, no ServiceNow, no ITSM/ITOM responsibilities inside my IT organization. How can that be wrong from a principles perspective?
I will, though, say that when you bring it back to us, to observability? This is the fundamental flaw in observability of SaaS, it's the way observability grew up. It started with New Relic/Wily technology. And Wily was based on an RDBMS. So, you had two nodes of Wily running on top of a traditional relational database like Postgres.
Then they moved to this database in the cloud. The first New Relic was Cassandra. The modern New Relic is homegrown on something they built themselves called New Relic Database. And Honeycomb has a database based on Scuba, Datadog has a database they built -- actually a bunch of the Hortonworks team is at Datadog building their data layer today. New Relic is built off of Hadoop as well.
Long story short, all these companies jumped to a conclusion, Beth, that there's too much data, it's too complicated to own the infrastructure, you must offload the data to a database in the cloud run by a vendor so that you can now do analytics. They missed the entire set of use cases and the unlock that we're creating for users, the value unlock we're creating around closed-loop.
The most important thing to do is while it's still fresh, while it's still on the wire, within a few milliseconds, within a few minutes, try to draw a conclusion and take an action. You should always be tilting toward action, not toward analytics. You should be tilting toward, do I know what's going on, and can I label the situation as good or bad, and do I need to take a remediation of any kind, roll forward, roll back, add more, run fewer, save money, prioritize users, prioritize budgets. Those decisions have to happen fast, and you need to take the human out of the loop.
The movement you and I haven't talked about is human-in-the-loop AI from five years ago. I'm getting rid of the human in the loop. Like, observability is built for humans to look at dashboards, and MyDecisive is built for robots to take action and keep systems healthy.
Pariseau: OK. So are there risks to that? I mean, you know, is there a risk of taking action prematurely or of a robot making a mistake?
Zilka: I mean, there's risks in anything.
Pariseau: Sure.
Zilka: I can't sit here in good conscience and say there's zero risk. But I spoke to hundreds of CTOs in my New Relic journey alone, forget when I was at Hortonworks. And just inside the observability space? They all said MTTR is not getting any better with my investment in you, number one, that's why we call it an observability tax, Beth. And, number two, the system issues are more often than not caused by a cascade of events that started with a human who made a small change and walked away. Right?
And so -- and I have an example of that. Like, we used to push our database, the New Relic database, and I don't want to talk too much and violate any NDAs, but I don't think I'm sharing any secrets here. We used to push our database at 2 in the morning, and the engineers who owned that database -- which wasn't me, but I was friends with them -- they said they're super tired, and they hated making the change because they would make mistakes. And they wouldn't find out about the mistakes till they woke up at 11:00 a.m., because they were up all night so they would set their alarm for 11:00 a.m., but they'd get paged with 'The customers are being hurt' by 8:00 a.m.
And they moved their change window to 2 p.m. on a Thursday. And they did that only after they built rolling update, so they could take a system and update it with zero downtime. And they walked away saying, great success, we change in the middle of the day, rolling updates, zero downtime, customers are happy, engineers are happy, problem solved. Now, of course, that was very manual as a solution. They didn't buy a product to do that; they reengineered their change control process.
But back to your question of is there risk? The answer is, it turns out, at 2 p.m., their rolling update was hurting users. Users went from 10-millisecond response time in the screens to one-and-a-half minutes. And these human engineers -- brilliant engineers, way smarter than me for sure -- who built this core globally scaled 6-exabyte database, they didn't look at the right signals.
I think we can do better. We're not going to be perfect, but it's the same argument for autonomous driving vehicles. Eventually, the vehicle can see more with more types of telemetry than a human driver can process.
And for us, MyDecisive SmartHub can see more, can see the entire estate simultaneously. You change something in a corner, and I see it cause ripple effects through the whole estate, and I tell you, 'Stop.' And then I can write runbooks that say, 'When you make that change, no one else should be allowed to change anything,' which is why letting a robot be the gatekeeper is very important. Because you don't know, when you're changing the network, you don't know if someone's changing the app, and the two of you together are causing chaos, maybe for the end users, maybe for each other, maybe for both.
Yes, we are going to make mistakes, but we're gonna make far fewer than humans can because it's a beyond-human-scale problem to watch the entire estate while you're changing things.
Pariseau: Yeah, things are definitely beyond human scale in so many different places.
Zilka: Yeah, totally.
Pariseau: Great. Well, thank you very much. It's been a great conversation. Really enjoy your insights on the market, and thank you very much for taking the time.
Zilka: Thank you. I appreciate it. Great questions, great conversation.
Pariseau: Thank you for tuning in to IT Ops Query. To learn more about enterprise software development and platform engineering, explore our content on Informa TechTarget sites. Find us on YouTube at our channel, Eye on Tech. Subscribe to our podcast to receive the latest episodes as they drop. And if you liked what you heard today, give us a rating and review on Apple, Spotify or wherever you're listening. Thank you for joining us.