Navigate serverless cloud SLAs for event-driven apps

Each cloud provider defines serverless uptime differently. Consider platform reliability, other guarantees and refund rates before building an event-driven app.

George Lawton

Published: 23 Apr 2019

Enterprises rely on SLAs to ensure software vendors follow through on their obligations, but they might need to reset their expectations when it comes to serverless computing.

Historically, cloud providers' service-level agreements (SLAs) have included uptime, with contractual remedies in place if those assurances aren't met. AWS, Google and Microsoft each offer guarantees for their serverless platforms, but the structures and protection levels of those contracts differ from what enterprises have become accustomed to with IaaS SLAs.

With serverless platforms, agility and flexibility supersede infrastructure reliability, said Jennifer Curry, SVP of global cloud services at managed service provider INAP.

"Traditional enterprises or IT managers who approach serverless SLAs like they would traditional IaaS SLAs will be in for a surprise," she said.

These services are regional and aren't tied to specific availability zones, which makes them less prone to disruption, said Sean Feeney, cloud engineering practice director at Nerdery, an IT consultancy. There are no maintenance windows or scheduled downtimes for serverless computing, which cannot be said for IaaS.

Still, serverless cloud SLA contracts permit several hours of uncompensated service interruptions each month. And even though downtime happens less frequently than it does with IaaS, these serverless agreements set a higher threshold for compensating enterprises when disruptions occur.

Compare vendors' serverless cloud SLAs

In addition to being different from IaaS guarantees, serverless cloud SLAs also vary from vendor to vendor. AWS provides service credits if Lambda uptimes drop below 99.95% per month. These credits range from 10-100% of a user's bill, with the full refund provided if the uptime drops below 95%.

To receive the credits, AWS customers must describe the SLA failure via a support ticket by the end of the second billing cycle after the incident occurred. The ticket has to include relevant, redacted logs to prove the SLA failure -- down to five-minute intervals.

For Azure Functions users with standard consumption and App Service Premium plans, Microsoft provides the same 99.95% monthly uptime guarantee. App Service offers some control over the underlying compute, such as Virtual Network connectivity and function warming, at a premium cost.

Azure's service credit ranges from 10-25%, with the highest refund given if service drops below 99% uptime. To qualify for the credit, Azure customers must describe the cloud SLA failure -- also through a support ticket -- within two months of the end of the impacted billing cycle.

Service-level agreement list — SLA checklist

Google Cloud Platform offers a 99.5% service-level objective (SLO) -- Google's variant of a serverless cloud SLA -- with service credits from 10% to 50% if its Cloud Functions service drops below that threshold. The 50% refund is given when uptime falls below 95%. The burden of proof falls on the customer, who must upload the relevant logs within 30 days.

Clearly, with these three providers, there are differences between the serverless cloud SLA thresholds and refunds. AWS might sound like the clear choice, as it promises the highest refund percentage, but customers should note that each provider defines uptime differently, said James Bowes, CTO of the developer services management platform Manifold.

Uptime calculation differences

Customers need to understand the indicators that make up the SLOs and SLAs for their serverless apps. This is difficult because there is no way to look at server uptime or network connectivity for these specific indicators, Bowes said. As a result, service providers base their SLAs on function invocations.

AWS calculates the percentage of successful requests over five-minute intervals and averages this percentage to calculate the monthly uptime. Google defines downtime as every minute where more than 10% of requests have failed. It then calculates the monthly uptime percentage by dividing the minutes in a month without downtime by the total minutes in a month. Microsoft, on the other hand, simply divides the successful serverless invocations by the total invocations during a month to calculate uptime.

With Google's and AWS' uptime calculation methods, an organization could suffer a complete outage during a 30-minute period when it launches a new product and opt not to use the service for the rest of the month and neither vendor would have violated their SLA, Bowes said. That's because, again, these calculations are based on how long these services are used rather than on their overall uptime status in a given month.

Plan for problems with serverless SLA best practices

It can be difficult to navigate uptime calculations and agreements. Serverless cloud SLA best practices can help.

Share the click-through SLAs for relevant services with the enterprise teams who deal with SLAs, such as IT ops and system architects. It is up to the company to catch SLA violations and request credits accordingly. Put the appropriate level of tooling in place to monitor uptime and collect logs to hold service providers accountable for their advertised serverless cloud SLAs.

Enterprises are comfortable with IaaS monitoring because it's essentially an extension of their existing practices for tracking on-premises virtualized environments. However, serverless computing requires a different set of tools to analyze logs, identify service-level breaches and prepare a request for remedies, Feeney said.

Another serverless cloud SLA best practice is to review each service provider's execution model for serverless functions. Each platform and programming language has its own delays for cold starts, which impact the time it takes to serve requests quickly after idle periods. Depending on the desired SLA, users may need to choose a language with a fast start time, keep their functions hot by running synthetic requests or use a dedicated, non-native functions-as-a-service platform -- such as Apache OpenWhisk -- on top of dedicated virtual machines, Bowes said.

There are many points of failure in a serverless deployment outside the service itself, so users should incorporate other safeguards beyond serverless cloud SLAs and consider ways to minimize user impact. For example, the network plays a big role in application uptime, and a virtual private cloud can add redundancies and protections that act as additional forms of service guarantee.

Users should architect applications to work well in a decomposed manner to address serverless SLAs, said Alex Higgins, DevOps technical lead for Candid Partners, a cloud consultancy. Developers and IT managers should focus on how serverless enables better scalability. By breaking apps into smaller pieces, IT teams can improve communication between components and replicate key functionality elsewhere when failures happen.

Also, users should understand the fault and failure domains of the components these functions are built on and architect the service accordingly. For example, if a user accepts all of AWS' U.S. East regions as a logical failure domain, they should duplicate the required production services out of U.S. West in case of a failure event.

Teams should also read and discuss Google's handbook on site reliability engineering, for guidance on building teams and infrastructure with a serverless mindset.

Navigate serverless cloud SLAs for event-driven apps

Each cloud provider defines serverless uptime differently. Consider platform reliability, other guarantees and refund rates before building an event-driven app.

Compare vendors' serverless cloud SLAs

Uptime calculation differences

Plan for problems with serverless SLA best practices

Dig Deeper on Cloud infrastructure design and management

SLA vs. XLA: Which one measures what matters?

SLAs for disaster recovery: Free template and guide

Building resilience in the cloud: Bridging SLA gaps and mitigating risk

Why SLA gaps should not hinder cloud innovation