Google Cloud outages resolved, but stakes still high for users

Several Google Cloud services suffered some recent hiccups, but analysts gave the service provider high marks for quick resolution and transparency.

Chris Kanaracus

Published: 15 Mar 2019

Although the Google Cloud outages and performance degradations this week were quickly repaired and ultimately had limited impact on customers, they served as a reminder to customers to keep pressure on vendors to improve cloud reliability.

Two Google Cloud services suffered interruptions on March 11:

Google Cloud Console, which customers use to manage their accounts and projects; and
Cloud Dataflow, a service used to process batch and stream data.

Cloud Console became unavailable for four hours due to a code change on its most recent version of Google Cloud's quota system, which rate-limits user requests, according to Google's post-mortem. The bug caused the system to fall back to a lower rate limit, which resulted in denied requests.

The Dataflow problem, which caused system lag that spanned more than 19 hours, was fully resolved on March 12. Google said it continues to investigate the cause.

Meanwhile, Google Cloud Storage experienced elevated error rates for a four-hour period on March 12, which hit all regions. The problem stemmed from actions taken by Google site reliability engineers (SREs), according to the post-mortem report.

Region isolation is the key to uptime and resilience in the cloud. And if vendors fail on that, it's a concern.

Holger Muelleranalyst, Constellation Research

On March 11, Google SREs discovered a surge in storage use for metadata connected to Google's internal blob storage service. To reduce this usage, SREs made a configuration change that caused a part of the system that looks up the location of blob data to overload, and that increased load eventually led to a "cascading failure," Google said.

A related disruption with Google App Engine caused problems with the Blobstore API and App Engine Version Deployment, which also lasted about four hours.

Among other measures, Google plans to improve the way it isolates regions within the storage service so future Google Cloud outages don't go global, according to the report.

Total cloud reliability remains an ambitious goal

Holger Mueller

All cloud service providers experience downtime. Yet, it is something Google Cloud, in particular, must address, given it lags far behind AWS and Azure in market share. It's a buyer's market, and customers will seek the most reliable option, although Google's speedy remediation efforts and transparency also are valuable to customers, said Holger Mueller, an analyst at Constellation Research in Cupertino, Calif.

However, some customers might worry that Google refers to adding more isolation between regions in its report detailing the Google Cloud storage outage.

"Region isolation is the key to uptime and resilience in the cloud. And if vendors fail on that, it's a concern," Mueller said. "The only way to find out if Google has addressed this successfully will be the next time it goes down."

Unplanned cloud service outages should happen less frequently over time through the use of advanced management, orchestration and load balancing techniques, said Stephen Elliot, an analyst at IDC. "These are the general parameters enterprise accounts expect," he said.

Google Cloud outages resolved, but stakes still high for users

Several Google Cloud services suffered some recent hiccups, but analysts gave the service provider high marks for quick resolution and transparency.

Total cloud reliability remains an ambitious goal

Dig Deeper on Cloud deployment and architecture

How breaking things builds resilient systems

An introduction to SRE documentation best practices

Understand the role of an SRE vs. cloud engineer

Familiarize yourself with these 7 key SRE terms