As public cloud vendors continue to roll out new pricing models and lower-cost services, enterprises need to fully understand what they sign up for before they apply these models to production workloads.
Preemptible VMs are the latest service in Google's price-cutting arsenal. These VM types are identical to standard Google Compute Engine (GCE) instances with one significant exception: Google can shut them down anytime it needs the compute capacity for other workloads. Because Google can sell preemptible VMs from capacity that would otherwise be idle, they're available at a discount.
Preemptible pricing is fixed, just like it is for standard GCE instances, with rates starting at $0.01 per hour. Across the range of GCE instance types -- from one to 96 virtual CPUs -- the preemptible discount for on-demand usage runs just under 80%. Given the unpredictable lifetime of preemptible VMs, they don't qualify for additional Google committed use or sustained use discounts.
The preemption process
Although preemptible VMs can shut down at any time, the service notifies users 30 seconds before termination. Google terminates all preemtible VMs after they run for 24 hours.
According to Google documentation, the preemption process is as follows:
- Once GCE needs the capacity, Google sends a preemption notification as an Advanced Configuration and Power Interface (ACPI) G2 Soft Off signal -- a standard motherboard soft shutdown command, which every OS can handle -- that signals the system must reboot.
- Ideally, the Soft Off signal then triggers a shutdown script that users have previously configured to save any system state and application data, terminate processes and stop the VM.
- If the instance is still running after 30 seconds, GCE sends an ACPI G3 Mechanical Off signal to the OS, which is the equivalent of pulling the power on a server.
- The GCE instance then enters a terminated state, which preserves its configuration settings, metadata and attachments to other resources -- such as storage volumes -- but destroys in-memory data and VM state. Users can choose to restart or delete an instance in a terminated state, or leave it terminated indefinitely.
Google documentation includes a sample shutdown script to gracefully terminate processes. Users can manually stop the instance to simulate instance preemption and test a shutdown script.
Since the service doesn't impose a downtime requirement, users can immediately make attempts to restart preempted instances. In other words, users shouldn't assume that the Google Cloud Platform (GCP) region is out of capacity just because Google preempted the instance. Google's infrastructure and dynamic mix of workloads mean that capacity might almost instantaneously free up elsewhere.
Still, Google doesn't guarantee availability and does not include preemptible VMs in the GCE service-level agreement.
Preemptible VMs in a cluster
While preemptible VMs would be incredibly disruptive for traditional enterprise applications, such as a database, they are well-suited for distributed systems that run across clusters of machines and are designed to tolerate failures.
There are two primary ways to automate cluster creation on GCP: managed instance groups and container cluster managers. Here's how preemptible VMs work with each:
- Managed instance groups: Instance groups are sets of VMs that scale automatically and that admins can centrally manage and configure using templates. Admins can configure instance groups to use preemptible VMs via the instance template, which also specifies a target size for the number of active instances that the autoscaler should maintain. Instance groups greatly simplify the use of preemptible VMs since, when Google forcibly terminates an instance, the autoscaler will automatically try to recreate it to maintain the target size. This gives users the benefit of cheap preemptible VMs with the convenience of automated capacity management.
- Container cluster managers: Kubernetes Engine and the container clusters it runs are another ideal way to use preemptible VMs. This is because, much like an instance group, the container manager automatically creates VMs. Users can also create mixed clusters of standard and preemptible VMs and use the Kubernetes node taint parameter to ensure that only certain workload pods, such as those that don't need guaranteed availability and performance, are placed on preemptible nodes.
Aside from the general guidance to use preemptible VMs with distributed systems, there are particular types of applications or usage scenarios that work well under the threat of lost nodes. These include:
- Imaging or rendering workloads that run processes on batches of photos or generated images. For example, Moonbot Studios, a multiplatform storytelling company, runs render nodes both on premises and in GCP and uses preemptible VMs to handle spikes in demand.
- Scientific and statistical simulations with parallelized algorithms that can tolerate lost nodes without disruption, such as a Northeastern University model that simulates the spread of communicable diseases.
- Machine and deep learning model training that similarly runs across a distributed cluster of machines in batch. Preemptible VMs became feasible for AI modeling once Google began to support the attachment of Nvidia Tesla K80 and P100 GPUs.
Other applications can use preemptible VMs as long as they can restart using state saved in a checkpoint file to Google Cloud Storage.
Preemtible VMs vs. Spot Instances
Preemptible VMs are similar to AWS Spot Instances in that both are susceptible to unanticipated termination. The key advantage to Google's approach is predictable pricing, since Spot Instances are subject to the variable nature of market forces.
AWS, however, provides an element of predictability through the use of Spot blocks, a feature that specifies a minimum time, from one to six hours, before it will terminate the instance due to a price change. This makes AWS better for long-running applications that you can't easily restart from a checkpoint.
In either case, cloud vendors offer significant financial incentives to users that grant them flexibility in system capacity management. Your task is to figure out how and where to take advantage of their offer.