Getty Images

Tip

The business case for a multi-cloud AI strategy

Using a single cloud provider for AI infrastructure can have its limitations. Discover the benefits businesses can realize by opting for a multi-cloud AI strategy.

AI has a computing problem.

The computing capacity and power demands required to train and operate AI platforms pose an incredible infrastructure challenge for AI adopters. AI models process trillions of parameters across increasingly vast data sets, and evolving AI techniques such as post-training scaling and test-time scaling can use 30 to 100 times the compute needed to train a model or perform inference.

There are numerous answers to address AI's computational challenges, such as improving energy-efficient coprocessors, integrating specialized chips for inference, and shifting toward smaller, more specialized models that require fewer resources. Still, the underlying problem for everyday businesses remains unchanged: Where is a business supposed to find all this AI infrastructure?

While the cloud offers a viable answer to AI's computational challenges through on-demand compute and infrastructure, it's not without its own limitations. To find a more flexible, tailored approach to AI infrastructure, many businesses are flocking to multi-cloud strategies.

AI and the cloud

Most businesses cannot afford to build and maintain an IT infrastructure capable of training machine learning (ML) models and operating AI systems as a direct in-house project. The time and costs are simply too high, and the long-term utilization could be too low.

The most efficient choice is to turn to the public cloud. A public cloud can provide the enormous computational and storage resources needed to train models and deploy AI systems. Cloud providers such as AWS, Azure and GCP have developed expertise in their infrastructure and invested in specialized hardware and low-latency networking. A public cloud also offers effective tools, vast scalability and a pay-as-you-go cost structure that complement a business's AI needs.

Image showing a list of public cloud main benefits, top purchase drives and top features.
The benefits of using a public cloud have made many businesses flock to the cloud for their AI computing needs.

Although a cloud computing provider can address many of the practical challenges of AI training and deployment, relying on a single provider has limitations.

A single cloud provider might not offer the best resources and services a business needs. For example, if an AI project can benefit from TPUs, and a cloud provider only offers GPUs, the project might take longer or cost more. Likewise, relying on a single cloud provider makes an AI project dependent on the provider's resilience and performance. Cloud outages happen, and downtime can leave a business's vital AI system unavailable for prolonged periods.

Additionally, emerging issues such as data sovereignty, regulatory constraints and legal exposure limit how users can deploy and operate AI systems. A single cloud provider might not have the global footprint needed to operate an AI system in the geopolitical areas the business needs.

These issues often lead to vendor lock-in, making a business dependent on its chosen provider's APIs, tools and services. High familiarity with a provider's offerings can be beneficial, but such singular utilization can make it incredibly difficult to switch providers if the need arises.

The move to multi-cloud AI

Businesses can address the limitations of using a single cloud provider by developing a multi-cloud AI strategy. In simple terms, a multi-cloud AI strategy is an organization's tailored approach to utilizing the resources, services and tools of two or more cloud providers within an AI system. Businesses can design a multi-cloud strategy for AI for several benefits:

  • Performance and cost management. With a multi-cloud strategy, a business can select the most cost-effective services or the most attractive discounts for various elements of the AI platform to achieve optimal AI system performance while cutting cloud costs.
  • Resilience. A business can duplicate AI system deployments across different providers and availability zones, enabling a mix of inference load balancing and resilience that enhances responsiveness and prevents AI system interruptions caused by provider- or location-specific outages.
  • Best-of-breed. A multi-cloud strategy enables businesses to use the best resources, services or tools for various aspects of their AI systems. A business might train an ML model using Google Cloud Vertex AI, then use Azure Cognitive Services for image or speech recognition, along with AWS SageMaker for inference.
  • AI vendor lock-in avoidance. Committing to a single cloud vendor's tools, APIs and roadmap can put cutting-edge AI projects at a potential disadvantage. Using two or more public cloud providers goes beyond best-of-breed to free businesses from lock-in and enhance system performance, localization and cost management.
  • Data storage and analytics. A business can employ a multi-cloud strategy to blend data storage from one provider with advanced analytics from another. An AI system might use massive data lakes built on AWS with Google Cloud for analytical services.
  • AI governance and risk management. AI systems are increasingly subject to expanding governance and compliance requirements. A multi-cloud strategy can help an AI system comply with data and workload sovereignty requirements by keeping data and workloads contained to specific regions where different cloud providers operate. Resilience and availability issues from governance demands can also benefit from two or more cloud providers.

Multi-cloud vs. hybrid cloud AI strategy

In a multi-cloud environment, multiple public cloud providers are used; in a hybrid cloud, a private cloud is connected to a public cloud. The two paradigms typically serve different goals:

  • A multi-cloud AI strategy combines cloud services at scale while mitigating cloud vendor lock-in. This brings management complexity and can pose greater risks across multiple providers and their differing APIs.
  • A hybrid cloud AI strategy supports security, ensuring that sensitive data remains local for compliance and governance purposes. Hybrid clouds are also used for cost management and data integration, though they require extensive in-house expertise to build and maintain.

It is possible to blend the two strategies. For example, a business might use a hybrid cloud to develop and test models locally, deploy the models to a public cloud for training and validation, and then expand the deployment to a multi-cloud environment for production operation.

Considerations of a multi-cloud AI strategy

Multi-cloud AI strategies demand careful attention to factors such as cloud complexity, interoperability, data portability, security and compliance. Important considerations when building a multi-cloud AI strategy typically include the following:

  • Service selection. This is the best-of-breed issue -- matching each specific AI task to the cloud provider best equipped to deliver that service. Select services based on factors such as ease of integration, performance, adherence to security and compliance requirements, cloud reliability and availability and total costs.
  • Infrastructure security. Multiple cloud providers require multiple infrastructure implementations using each provider's APIs to select resources and services. Employ a zero-trust security approach to prevent unintended or malicious access. Also, employ a resilient low-latency network connection between cloud environments.
  • Data security and compliance. Use consistent data security policies so that every cloud provider is used in the same way, such as using encryption and role-based access control. Consistent security policies ensure both adequate data protection and that multiple cloud providers meet compliance requirements.
  • Data and AI portability. Cloud providers still use data egress fees to incentivize users to keep their data in the provider's cloud. Other data lock-in mechanisms can include proprietary data formatting. Assess whether data needs to move between providers and determine the time and cost required to make such migrations practical. Use containerization and cloud-agnostic components to enable AI models and other application components to move easily between providers.
Using multiple public clouds can benefit AI platforms, but adopters must be prepared to mitigate the complexities of a multi-cloud AI strategy.
  • Centralized monitoring. Monitoring individual cloud providers can be time-consuming. Use third-party multi-cloud tools to monitor performance, manage resources, enforce policies and optimize costs among cloud providers. Seek a single-pane-of-glass monitoring platform with tools such as Datadog, Dynatrace, IBM Instana, New Relic and Splunk. FinOps practices can also oversee cloud usage and spend.
  • Skills and expertise. Building and operating AI applications across multiple clouds demands significant expertise with each provider, making it more difficult and costly for a business to staff. Consider the teams and skills available for successful multi-cloud ventures. It might be necessary to enhance staffing and develop vital cross-training to support complex multi-cloud projects.

Managing the complexities of a multi-cloud AI strategy

Using multiple public clouds can benefit AI platforms, but adopters must prepare to mitigate the complexities of a multi-cloud AI strategy. Mitigation approaches include the following:

  • Embrace standardization. Varied clouds can cause costly human errors. Use standardization to create homogeneous AI environments that yield consistent and reproducible baselines. This reduces errors, eases scaling and speeds interactions between cloud providers. Version-controlled infrastructure-as-code approaches are ideal for standardization.
  • Build for portability. Build AI applications using advanced software design paradigms, including microservices, on container platforms such as Docker. Pair container engines with orchestration and automation systems such as Kubernetes. This significantly enhances application portability, improves scalability and supports standardization in deployment and operation.
  • Observe and improve. Automation and orchestration are vital in multi-cloud environments. Track performance and spending, continuously optimize environments and rely on automation for scaling, cost control and resilience. Clear visibility and continuous improvement can help establish effective control and reduce complexities.

Stephen J. Bigelow, senior technology editor at TechTarget, has more than 30 years of technical writing experience in the PC and technology industry.

Next Steps

Is agentic AI the future of cloud infrastructure management?

Dig Deeper on AI infrastructure