Prometheus Q&A: How the Kubernetes monitoring tool is evolving

Prometheus has a reputation for being hard to work with. Richard Hartmann, a member of the Prometheus team and community director at Grafana Labs, talks about how that might be changing.

Prometheus is a time-series event monitoring tool for cloud-native, containerized environments -- particularly for use in Kubernetes ecosystems. In fact, because both are based on tools designed for internal use at Google, Prometheus inherently complements Kubernetes and integrates with the orchestration platform.

But Prometheus has also proven to be complicated to use in the past, with vendors building integrations to abstract the tool's complexities away for their users. However, that might be changing, as the group overseeing the open source project has spent the past year taking deliberate steps to address some of the tool's shortcomings and reach a broader audience.

With KubeCon + CloudNativeCon Europe in full swing, SearchITOperations caught up with Richard Hartmann, the community director at Grafana Labs, before the conference to find out what the Prometheus team has been working on recently. Hartmann is also the founder of OpenMetrics and a member of the Prometheus team, a PromCon lead and the Cloud Native Computing Foundation (CNCF) SIG Observability chair.

Editor's note: Answers have been edited for clarity and brevity.

What has the Prometheus team been working on over the past year?

Richard HartmannRichard Hartmann

Richard Hartmann: For the last maybe half [to] three quarters of a year, we've been actively trying to [update] the Prometheus project along various dependencies. The general gist is to make it more open. Arguably, it was open before, but we have historically [been] quite averse to sweeping changes.

We made a deliberate decision to reconsider several old design decisions to adapt to this new wave of use cases, because we are seeing yet another wave of major adoption. In particular, as cloud native and [the] CNCF keep swapping over the edges into the non-cloud-native spaces -- and … [as] best practices established within the CNCF take up more space in general IT. We're also seeing a lot more usage in Prometheus. So for those reasons and others, we decided to open a few things.

What does 'things' mean in this context -- source code?

Hartmann: Well, the source code, not so much, but [rather] design approaches. So, for example, we had service discovery and Alertmanager integrations. Service discovery is an automated way to publish targets to monitor -- and you have integrations like Kubernetes, [which] have been there for ages. But the relatively new cloud providers -- or the [smaller] cloud providers -- we haven't really put any new things in … In 2020 that changed. We are still picking up speed, where we are introducing … more integrations for new cloud providers … You point your Prometheus at a defined endpoint, and everything happens automatically … which is obviously good. Same for Alertmanager, which is a mechanism [that enables] Prometheus [to detect] systems in the wrong state and emit this information to other systems, so humans can actually react.

Early this year, we chose to treat what we call 'experimental' differently: Historically, we have treated even experimental code and interfaces as rock solid -- which is, on one hand, nice for users, because they can rely on even experimental features being there long-term. On the other hand, we are locking ourselves in.

In this [new] model, Prometheus becomes part of a larger system, or a larger pipeline, where Prometheus ingests all the data and sends it off to cloud providers -- or [users'] own services. And that mechanism [to send data] from Prometheus to other storage back ends -- that is called Prometheus remote write.

To clarify -- Prometheus remote write is experimental?

Hartmann: Yes, this is something we have called experimental, but it's been stable for two or three years.

We attached a version number to it; we wrote a specification; we wrote the test suite. Now we can start to break it up again, as we have the stable [code] base everyone can test against. Other [ways] we treat the 'experimental' differently: We introduced feature flags … but it's not enabled by default. And it [still] might change -- we [the Prometheus team] will not lock ourselves down to treat everything … experimental as stable forevermore. By doing this, we can do things [we] didn't even consider.

At the last Prometheus dev summit, … We decided to accept PromQL, which is the language to [manage] all Prometheus data in the complete Kubernetes ecosystem.

How is PromQL different from other query languages?

Hartmann: PromQL is a functional language, not imperative like SQL. The beauty of it is it has a lot of magic there ... [PromQL is similar to] vector computing in as much as it is a vector of math language, where I set out my grid of computation once. And then I just toss an arbitrary amount of data into this computation, and the right thing just happens, which makes it easy to work with insanely large amount of data…

That said, a lot of syntactic 'sugar' hasn't been in PromQL, because the primary use case was [maximum efficiency and speed]. Convenience features aren't supported, but we are starting to add them. Because what we're seeing is [that], as more people start using Prometheus, this goes farther away from this hardcore tech operating model … into the [increasingly] wider ecosystem ... We see this need come up more and more as the user base extends.

In the past, critics have said Prometheus was too great a beast to tackle without a third-party integration. Are these ease-of-use feature additions part of a roadmap to correct those issues and enable IT teams to use Prometheus on its own?

Hartmann: Yes and no. Prometheus is built to be a brutally efficient time-series database. [It] follows Unix design principles -- do one thing and do it well. So [Prometheus] is designed to not do a few things, like long-term storage, [for example]. We want people to use other [tools] if they want [features like] long-term high availability. It's not because we couldn't support it; it's because the design trade-offs we would have to make would lessen the ease of those other use cases. That's one part of the answer.

The next part of the answer is that Prometheus is kind of hard to hold, and we had relatively few self-care mechanisms and self-alerting … If you come from a sysadmin, DevOps [or] SRE type of background, it makes immediate sense, and it's easy to operate. But if you're not, you have this learning curve, which sometimes has been described more like a wall.

What are some of the newest updates that you're most excited about?

Hartmann: We want to support Raspberry Pi. We dropped this a few years ago, but we want to have it again … and we are working to extend PromQL and building [up to] things that, historically, we have not been willing to do. Downsampling is another huge [update] because you can either save even more space, or you can have quick queries [that go] back months and years.

For long-spending things, you simply look at the downsample data. If I want to know the performance characteristic of one thing for the last 12 months, I would look at the sample data. And if I want the precise values of that one day [spanning] to a month ago, I can still go into nonstop data.

What updates will Prometheus highlight at the KubeCon + CloudNativeCon 2021?

Hartmann: We will raise a Prometheus conformance program, [for which] we had test suites and such for ages. But we are now making a real push to enable others to certify that they're actually compliant with how bits and pieces of Prometheus work. And if all the relevant components are 100% compliant with the specifications … Then you can call yourself Prometheus compliant.

Next Steps

Consider Grafana vs. Prometheus for your time-series tools

Dig Deeper on Containers and virtualization

Software Quality
App Architecture
Cloud Computing
Data Center