Getty Images


How AI and ML can transform CloudOps

AI and ML tools support several use cases in cloud operations, such as security, fault correlation and latency. These best practices can help CloudOps teams take the right steps.

AI and machine learning are promising to change almost every aspect of life, so it's not surprising that enterprises hope one or both also transform cloud operations. Every enterprise wants efficient, error-free IT operations, and AI and ML top the list of technologies they hope can provide it.

ML can harness the insights contained in the records organizations have on cloud operations and conditions. AI can extrapolate past trends or broader experience to create a better understanding of the future. Learn key AI and ML use cases in CloudOps, and get familiar with some tool examples.

Planning with AI and ML

Until recently, most AI and ML applications in cloud operations have focused on using data analysis to improve resource utilization and efficiency, primarily at a planning level. AI and ML, but particularly ML, can analyze historical patterns of usage both in terms of offered load -- users and user activity -- and resource commitments. They can also develop optimum practices to balance resource usage over time. These capabilities can help enterprises justify building new cloud applications, while periodically reviewing resource usage and managing costs.

This planning-level use case has evolved toward more real-time analysis of cloud usage and logs. Users find that AI and ML can spot changes in cloud usage better than human review. The analysis could make recommendations to change resource commitments in real time or even order resources dynamically.

But users can be reluctant to trust any automated system to place orders for cloud resources, reduce committed cloud resources or lower cloud cost without significantly affecting application quality of experience (QoE). For now, most users prefer recommendations. This real-time analysis requires application and resource monitoring applications with AI and ML capabilities.

Scaling with AI

AI can be particularly beneficial in scaling. The relationship between user QoE and allocated cloud resources is complex. It's not a simple matter of doubling QoE if enterprises double resources. Generally, the effect of additional resources on factors like response time declines as load increases.

AI provides a means of predicting QoE effects before resources are scaled up or down. Thus, it offers a more refined way of managing scaling than allowing resources to increase when load increases, and vice versa. This calculation support is particularly valuable in hybrid and multi-cloud deployments where scaling can cross administrative boundaries.

Support for scaling requires more specialized real-time AI and ML tools focused on observability at both the application and resource levels.


Another related and expanding mission of AI in CloudOps is managing the placement of application components to control latency. IoT applications that gather sensor information to control real-time processes or movements have a specific delay budget. It's difficult for operations teams to manage latency when users have distributed cloud processes among the edge point of process connection, the cloud and the data center.

If it's possible to instantiate an application component in any of these locations, the optimization of latency and hosting cost is complex. If instantiation must be done quickly, it's likely impossible for operations staff to handle.

Application observability tools with AI and ML support are most suitable for latency management.

Security and compliance

Security and compliance policies are an area where AI and ML can provide significant overall benefits. Policy enforcement using manual tools is always a major challenge because of the work involved and the risk of errors and omissions.

Cloud resource commitments and workflow connections can generate alerts, which AI and ML process against security and compliance policies. ML can spot new issues by comparing the patterns of cloud deployment and connection with past practices. AI does the same to assess patterns versus security and compliance policies.

A few specialized security tools with AI and ML are available, but application observability tools can often help.

Alerts and fault correlation

Given that AI and ML can process alerts for security and compliance, it's a small step toward using them for alert management and fault correlation. When done right, AI and ML reduce the chances of a fault storm that can overwhelm operations personnel. If done wrong, AI and ML can introduce errors that are hidden from the operations team. These hidden errors are likely to create major problems in application stability and performance.

The difference between the right and wrong use of AI and ML is largely based on training the system on a company's own data. Pre-trained AI and ML tools aren't likely to reflect the specific way enterprises use the cloud and the conditions that are important to them. Specialized AI and ML tools are available for alert filtering and fault correlation, and it's important for teams to try tools out or review features to be sure they fit their mission.

In theory, AI and ML can implement changes and fixes, rather than suggest them, but users tend to be wary of this for two reasons. First, AI and ML tools can make major mistakes, as stories on generative AI show. These mistakes can create problems potentially more serious than the ones that generated the alerts. Second, over time, operations personnel tend to treat closed-loop, reactive AI and ML as autopilot. If they don't follow events, they're at risk of losing their overview of cloud resources and application status, which makes it difficult for them to step in if automated systems fail.

Observability tools

Tools that support these AI and ML transformations in cloud operations are varied. Some include generalized AIOps tools, as well as tools designed for broader application and resource observability and alert handling.

Use cases that involve data analysis are often supported by the same business AI and ML analytics products many companies already use. However, more specialized operations-centric tools, such as PagerDuty, might be more efficient.

Other more specialized tools include the following:

  • Observability tools, such as BigPanda, Coralogix, Dynatrace, Netreo and New Relic.
  • Problem monitoring tools, such as LogicMonitor.
  • Root cause analysis tools, such as Moogsoft and Operations Bridge from Micro Focus.

Products designed for generalized AIOps and focused on machine learning, such as Grok, are also applicable to the cloud, including hybrid and multi-cloud.

Defining a data lake

A final but important step in the use of AI and ML in CloudOps is to define a data lake that contains the information the tools use. Where public cloud AI and ML tools are used, a properly defined data lake reduces or eliminates security and compliance risks by stripping any business- or user-critical information before handing data off to AI and ML tools. This sort of data is most likely found in application tracing associated with automated testing.

Even privately hosted AI and ML tools that keep all information on premises can create a security problem if breached. Explicit creation of a data lake encourages teams to examine the specific needs of the AI and ML applications, ensuring proper information is available. All this makes for a better outcome with AI and ML in CloudOps.

Dig Deeper on Cloud app development and management

Data Center