Get started with threshold monitoring
IT monitoring doesn't have to be difficult to set up and use. Learn how to set thresholds and dashboards, know when and how to escalate responses, and keep IT systems humming along.
Monitoring metrics empower an IT team to both optimize and secure services. Thankfully, monitoring platforms offer plenty of customization to fit your organization's needs.
As part of that, there are some crucial considerations to keep in mind when designing threshold monitoring for your IT tool.
Active monitoring has its value, but a great platform makes your involvement much more passive. Thresholds are synonymous with automation. Once you've established your system's steady state -- or baseline operating conditions -- you can configure thresholds for core system performance indicators. These can include the following:
- process activity
- user activity
- CPU load
- memory consumption
- file disk usage
- errors and error logging
- login activity
As we can see, threshold monitoring isn't only about performance. It's also about access. You want to know exactly how well your services perform and who's using them. The rule sets you'll write will relate to the above categories. Thankfully, defining your threshold sets is straightforward:
- Define whether you want your rules to be driven by thresholds (resources) or events (errors and logins). Your monitoring tool will trigger notifications if these thresholds are crossed.
- Group these written rules into rule sets, based on categorizations and device-specific deployments.
- Assign sets to various devices as needed.
- Update your agents to ensure these rule sets are active.
Once you establish these notifications, how are they delivered? The two most popular ways are via email and Simple Network Management Protocol traps. The latter method, which you configure with a network management tool, sends internal emergency messages from impacted devices to ensure your system -- and you -- are in the know.
The power of alerts and dashboards
Monitoring thresholds are useful for identifying abnormal spikes or dips in activity, while alerts tell us when our systems can't automatically recover. Two things are key: human intervention and relevance. For example, you should configure your alerts so that they reach actual members of your team and those notified must be able to tackle the existing problem. Also, system issues must be severe enough to warrant these notifications; otherwise, you'll spam your DevOps team.
Comparatively, thresholds grant us greater overall status monitoring -- that's their intention. While thresholds tell us something problematic might happen, alerts confirm this beyond a shadow of a doubt. That said, severity does vary. We often assign priority levels to alerts. External alerts for user-facing failures typically take precedence over internal hiccups. When a system outage grows in scale, you might need more hands on deck for remediation. Widespread problems like those, which require formidable levels of intervention, will carry higher priority than their smaller counterparts.
Dashboards are the lifeblood of your environment. They're the simplest way to visualize performance, and they keep tabs on your prioritized metrics, such as traffic, resource consumption and usage patterns.
You might want to break down your monitoring at the application level. On the other hand, chances are high that you've deployed Kubernetes atop your microservices. You can configure different dashboards to examine both levels of your environment -- observing both the forest and the trees within it.
Start small with application monitoring
Application monitoring is the most granular approach, since you can rapidly apply features, bug fixes and security updates. This is where CI/CD come into play. The idea is that your applications are always changing, and these dashboards can help that for the better.
Popular tools such as Jaeger and OpenCensus provide numerous functionalities to further those goals. These types of monitoring platforms enable the following:
- tracing and response time analysis
- integration with service meshes and cloud networking resources
- host data inspection
- interfacing with other third-party platforms
This is what we'd consider the micro level, where you can really dig in and unearth the smallest of optimization gremlins.
Oversee an entire DevOps environment
When you're managing multiple containers or networking systems simultaneously, a rich dashboard experience is vital. It's not easy to oversee all aspects of your ecosystem effectively, especially at scale, to keep services running smoothly. With platform oversight, there's a lot to process, and it can be difficult to manually organize all that data cohesively.
Before you can analyze any data, you must ingest it. The Prometheus product scrapes pertinent information from your nodes and containerized applications, pulls these metrics into the dashboard and displays them in legible blocks. Prometheus is especially useful at the node level to diagnose failures.
Microservices run on distributed systems. Resources are provisioned, often across servers, which are then contacted via clients (users). Every request uses some degree of system resources, so it's useful to pick a tool that scales readily with activity. Grafana, for example, offers a panel-driven observation environment. DevOps teams can share these panels and capture systemic performance at certain moments in time (snapshots). This is helpful to track response time, volumes and network traffic. Here are some other dashboard visualizations you should employ:
- line charts
- heat maps
- flame graphs
Consider open source tools, which play nicely with other tools and are add-on-friendly. That could be crucial as your services grow and priorities shift.
Don't be fooled into thinking this is a one-size-fits-all approach. These are highly customizable, while taking the guesswork out of configuration. They're also scalable. You won't have to constantly seek new tools as your system evolves.
Remember, monitoring doesn't have to be difficult.
AWS monitoring best practices extend beyond CloudWatch
Compare Grafana vs. Datadog for IT monitoring
Learn how New Relic works, and when to use it for IT monitoring