Amazon DevOps Guru is a relatively new service that uses machine learning to analyze data from cloud services, such as CloudWatch, CloudTrail, X-Ray and AWS Config. It identifies application use patterns and anomalies and detects potential issues early. It also enables IT admins to analyze past and current issues, visualize findings and send notifications on issues. A key advantage of this service is Amazon's significant data and system management experience, which it uses to train the machine learning algorithms that feed DevOps Guru.
What Amazon DevOps Guru provides
DevOps Guru delivers two types of findings: reactive insights and proactive insights. Reactive insights provide a list of issues that are occurring now or have occurred in the past; proactive insights are recommendations to prevent future issues.
IT and cloud admins don't need to configure a long list of parameters -- just the resources they want to monitor. Because Amazon DevOps Guru uses machine learning algorithms to recognize patterns from multiple data sources, it is a powerful tool to identify anomalies in the resources it monitors. In my experience at time of publication, most DevOps Guru findings are reactive insights.
Upon selecting an insight, the console displays detailed information and graphs that describe the anomaly's nature and severity, as well as the main affected metric and other anomalies. When troubleshooting operational issues, review not only the primary metric, but also related metrics. This can show how multiple metrics relate to each other and guide developers to the root cause of an issue. The breadth of metric anomalies displayed in one space helps IT ops pros uncover nonobvious causes and effects of a particular issue.
What to do with an insight
When Amazon DevOps Guru finds an insight, it automatically sends an event to Amazon EventBridge. Application owners can configure rules in EventBridge, including targets on which to take corrective actions or send notifications. EventBridge enables IT pros to configure rules that find specific text patterns in the incoming event. This enables developers to configure responses to specific events -- for example, events that contain a resource name or specific metric, such as HTTPCode_ELB_5XX_Count or 5xxErrorRate. Some targets supported by Amazon EventBridge include Lambda functions, Kinesis streams, Amazon Simple Notification Service topics, Amazon Simple Queue Service queues, state machines managed by Step Functions and AWS Systems Manager Run Command executions. These targets enable a wide range of customizations in terms of operational tasks that can be automated in response to insights.
Configuring DevOps Guru is relatively simple, as the only parameters to configure relate to the AWS resources to be analyzed. DevOps Guru offers the option to either choose resources based on tags or CloudFormation stacks or to select all applicable resources in the account. It supports a wide range of AWS resource types, including CloudFront distributions, application load balancers, Amazon Elastic Compute Cloud instances, Simple Storage Service buckets, Lambda functions, Redshift clusters, Amazon Elastic Container Service (ECS) services and Amazon Relational Database Service (RDS) databases. It also offers a cost calculator based on the selected analyzed resources. Each resource can cost approximately $2 to $3 per month -- be aware of this model, and be selective. Choosing to analyze all resources in the AWS account can quickly lead to hundreds of dollars or more spent each month.
The console also offers an option to rate each insight with a thumbs-up or thumbs-down icon, which helps refine DevOps Guru's algorithms and future findings. Insights are automatically rated as high, medium or low, based on DevOps Guru's machine learning algorithms. If using RDS, enable the Performance Insights feature in the databases to be analyzed. For ECS deployments, enable the Container Insights feature, which publishes an extra set of CloudWatch metrics related to ECS services for DevOps Guru to analyze.
Amazon DevOps Guru delivers useful data that can save DevOps admins a significant amount of time when investigating an operational issue's root cause. It also enables IT admins to automate corrective actions by implementing custom software that reacts to specific insights. While configuring CloudWatch alarms remains essential to prevent and react to operational issues, DevOps Guru complements and strengthens this functionality by introducing machine learning to the metric analysis process.