Getty Images/iStockphoto

Tip

Optimizing hybrid cloud architecture for AI workloads

Hybrid cloud architecture can help with the unique scalability, hardware and governance needs of AI workloads. But is it the right choice for your business?

Using a hybrid cloud architecture to deploy AI workloads often makes sense. Hybrid cloud delivers the scalability and flexibility that's usually hard to implement on-premises, and tighter privacy and governance controls than are achievable in a standard public cloud environment.

But that doesn't mean that hybrid cloud is right for every AI workload -- or that all types of hybrid cloud architectures will work equally well for AI. For guidance on whether to choose a hybrid cloud environment for AI, and how to optimize it if you do, read on as we unpack everything to know about the relationship between hybrid cloud and AI.

The unique demands of AI workloads

While AI workloads vary in type and requirements, many are subject to specific needs that tend not to be so important for other types of applications or services:

  • Scalability. AI workloads often must process massive amounts of data, both during training and inference. This requires the ability to scale and avoid letting compute or storage infrastructure constraints limit the amount of data AI can handle.
  • Cost variability. The amount of data that AI systems process can fluctuate over time. By extension, the cost of processing that data can also vary, which is another reason why it's valuable to have a flexible infrastructure where a business pays only for the compute and storage it actually uses.
  • Specialized hardware. To perform optimally at scale, AI workloads might require special types of hardware, like GPUs or ASICs. Specialized hardware excels at enabling high levels of parallel computing, an ability that boosts the performance of most types of AI models.
  • Data privacy. If AI workloads process sensitive data, as many do, they must comply with governance policies and regulations designed to restrict access to that data.
  • Data sovereignty. Some AI workloads might also need to ensure the data they ingest and generate resides in a specific country or region.
  • Geographic distribution. The consumers of AI services can be distributed across various geographic locations. Given that, AI often works best when the models also simultaneously reside in multiple locations. This reduces network latency and improves performance when connecting to AI services.

How hybrid cloud benefits AI

Hybrid cloud infrastructure, which combines public cloud resources with those hosted on-premises, is a natural way to meet many of the challenges described above. Specifically, hybrid cloud offers the following key benefits for AI:

1. Scalability

Unlike an environment built using on-premises infrastructure alone, hybrid clouds can scale easily. If AI workloads need more compute or storage resources, they can be provisioned using public cloud infrastructure; cloud scalability is virtually unlimited.

2. Privacy and security

The on-premises component of a hybrid cloud delivers greater privacy and security than conventional public clouds. While there's nothing inherently insecure about public cloud infrastructure, organizations typically get more opportunities to implement extensive governance, data separation and network isolation controls on-premises or in a hybrid cloud that includes on-premises resources.

3. Hardware customizability

The on-premises infrastructure of a hybrid cloud enables the use of custom or specialized hardware for AI without limitations. When a business owns the infrastructure, it can deploy any device it wants. It also has full control of device configuration.

Many traditional public clouds also offer servers with specialized hardware options, like GPUs. However, the selection is limited to the device types that the public cloud vendor provides, and customers have limited control over device configuration. As a result, hybrid cloud is a superior approach if hardware customizability is a priority.

4. Latency minimization

Hybrid clouds provide multiple potential locations for hosting workloads -- on-premises as well as in the public cloud regions that are part of the hybrid cloud environment. This feature helps minimize AI workload latency and boost performance.

Here again, it's worth noting that public clouds also offer locational flexibility. However, it's limited because customers don't have as much control over where workloads reside as they would if they used a hybrid cloud. In a hybrid cloud, they can deploy infrastructure in on-premises locations, as well as private data centers or colocation facilities.

5. Cost savings

A hybrid cloud can reduce AI workload operational costs. It makes it easy to scale infrastructure down during periods of reduced demand, saving money. A standard private cloud or on-premises environment, where a business buys hardware upfront, can't do that.

From a Capex perspective, a hybrid cloud can also reduce the amount of money organizations need to invest upfront in hardware. They can buy some on-premises or private servers, while paying as they go for the infrastructure they consume in the public cloud part of the hybrid environment.

6. Sustainability

The sustainability of data centers and cloud infrastructure varies depending on many factors, such as the power sources and cooling systems they use. Generally, the larger the data center, the more energy- and water-efficient it is, primarily due to economies of scale. For example, in 2023, IDC  found that public cloud data centers were 4.7 times more carbon-efficient during operation than smaller, private data centers. However, the report didn't consider the sustainability effect of data center buildout, which is another major consideration.

From a sustainability perspective, this means that hybrid cloud offers the benefit of letting businesses take advantage of public cloud infrastructure, which is generally more sustainable, when deploying AI, and also optimizing the privacy and security controls associated with private infrastructure. This is an important consideration given growing concerns about the environmental effects of AI.

Hybrid cloud implementation challenges and best practices

While hybrid cloud can be a best-of-both-worlds option for hosting AI, it also presents some notable challenges and drawbacks:

1. Complexity

Managing both public cloud and private infrastructure simultaneously is more complex than using just one. Hybrid cloud management platforms, which include open source options like Kubernetes, as well as cloud vendor offerings such as AWS Outposts and Microsoft Azure Arc, can streamline the task. However, they still require more technical expertise and variable management than businesses would face using just the public cloud or just a private cloud.

2. Higher costs

While hybrid clouds can optimize spending by scaling down infrastructure when it's not needed, they might also end up being more costly. This is partly because of the cost of hybrid cloud management software, which is an added expense unless an organization uses a free, open source option. Higher costs can also be attributed to the risk that, complexity might keep an organization from effectively optimizing its infrastructure costs. To address this challenge, it's critical to track infrastructure spending and embrace FinOps best practices.

3. Lock-in

Hybrid cloud management frameworks can present a lock-in risk because it can be difficult to migrate AI workloads to a different hybrid cloud platform. Choosing an infrastructure-agnostic, open source hybrid cloud platform, such as Kubernetes, can reduce this risk. 

4. Data decentralization

AI workloads hosted on a hybrid cloud might end up storing data in multiple, disparate locations. This is a benefit if it reduces latency and improves locational flexibility. But it can also make it challenging to access or manage data. For this reason, it's essential to plan a data architecture with controls that can operate across a distributed environment and to ensure that data sources hosted in disparate locations integrate with one another.

Is hybrid cloud right for your enterprise AI strategy?

So, is hybrid cloud the best way to host your business's AI workloads? Or should you stick with either a traditional public cloud infrastructure or a private cloud?

The answer comes down, in part, to whether the benefits -- like scalability and deeper control over infrastructure -- outweigh the challenges, such as added complexity and lock-in risks. To figure this out, consider the following:

  • AI maturity level. If you're still experimenting with AI, you probably stand to gain less from hosting it in a hybrid cloud than you would if you had gone all-in on AI and need to maximize the efficiency and security of large-scale AI workloads.
  • AI workload requirements. The types of AI workloads you're deploying are a factor to consider. For example, if you're training custom models, the ability to use custom AI accelerator hardware in a hybrid cloud is probably more important than it would be than if you're simply deploying pretrained models.
  • Industry and sector. Organizations in highly regulated verticals, such as finance or government, are likely to benefit most from a hybrid cloud, as it can help them meet specific compliance and security requirements related to AI.
  • Business size. Larger businesses tend to have a greater need to distribute workloads across multiple areas and serve disparate sets of users. This need generally translates to more benefit from a hybrid cloud.
  • Technical expertise. The greater a business's IT talent resources are, the more it can take advantage of a hybrid cloud for AI. Businesses with limited IT staff or expertise might benefit from sticking with a simpler cloud architecture.

Chris Tozzi is a freelance writer, research adviser, and professor of IT and society. He has previously worked as a journalist and Linux systems administrator.

Dig Deeper on AI infrastructure