imtmphoto - Fotolia

Google SREs test the limits of infrastructure automation

Google SRE veterans spoke publicly about infrastructure automation work, its future and its limitations.

BROOKLYN, N.Y. -- Google SREs shared insights about the infrastructure automation work at the heart of the job, its future and some warnings about the IT and business problems it can't solve.

Engineers at Google coined the term 'site reliability engineer' (SRE) in 2003, literally wrote the book on the subject, and have defined best practices for the role ever since. IT ops pros throughout the industry now aspire to become SREs as enterprise DevOps takes hold, and the infrastructure automation that is the hallmark of Google's SREs is popular at mainstream companies.

However, Google reps warned in presentations at SREcon here last week, infrastructure automation is a crucial tool, but not a panacea.

"Instead of thinking about automation as a way to replace people, I like to think of it as a way to write programs to amplify people," said Max Luebbe, a Google SRE with a focus on data center initiation projects for Google Cloud Platform regions. "Human judgment has value -- it's not something you should look to eliminate, but look to how you can get more use and impact out of it."

This philosophy resonated with conference attendees, who discussed human and logistical challenges within SRE work as much as technical obstacles.

"Software engineering is not about coding," said a director of cloud architecture at a financial services firm who requested anonymity. "Identifying common components for reusability and not overengineering should be the focus."

Sisyphus and the 'curse of SRE autonomy'

One infrastructure automation tool shared by most Google SRE teams in multiple divisions of the company has almost none of the characteristics people might expect.

The tool, codenamed Sisyphus, is written in Python, which is very difficult to debug as the code base grows. It has no defined roadmap, list of product requirements, service-level objectives, product or project manager. Its design documentation was first written six months after its initial use.

"Sisyphus' success is baffling," said Richard Bondi, a Google SRE tech writer, in a presentation. "It can't be [because of] what a lot of people consider traditional best practice, because Sisyphus violates all of them."

Google SRE Nikolaus Rath
Google SRE Nikolaus Rath spoke about infrastructure automation at SREcon.

Still, many tools have been created to replace Sisyphus at Google, and all have failed, Bondi said. They all required SRE teams to standardize on certain infrastructure automation workflows, which flew in the face of strong SRE autonomy and even tribalism within the company. Sisyphus, by contrast, adapts to multiple approaches to infrastructure automation, and that ultimately mattered to Google SREs much more than whether it's the most elegantly written tool, Bondi said.

"I've overheard a lot of conversations here about how to change SRE culture," he concluded. "But I would say, take Sisyphus as an example and a warning -- trying to change culture usually ends in tears. Sisyphus is an example of how to adapt to an existing culture in order to get people to change."

Google SREs tackle progressive automation

Another common mistake with rookie SREs is that they focus too much on infrastructure automation as a project goal in and of itself, and assume that automation is always the best way to improve IT efficiency, Luebbe added in his presentation.

For example, Luebbe's team last year used automation to spin up a new GCP region in Mumbai, India. The focus of that project was not to automate, but to meet the service-level objective to create the new region as quickly as possible for the business, he said. That meant the SRE team used nontechnical tools such as checklists to plan its work, and to avoid automation where humans could do the work more quickly, such as with load-balancer configurations in GCP's case.

You have to balance the effort to automate with the payoff, and how often it will be used. Automation itself is not the goal.
Max LuebbeSRE, Google

"You have to balance the effort to automate with the payoff, and how often it will be used," he said. "Automation itself is not the goal."

SRE newbies also tend to take a Waterfall-style approach to infrastructure automation, where they strive to understand everything about the infrastructure and its requirements before they begin, but this is impractical and counterproductive, Luebbe said.

"Everything we've learned about software development, we throw out the window when we think about automation," he said. Instead, infrastructure automation tool development should also take an iterative approach. "It's better to write something that might replace 50% of the toil that's required, than to say [to stakeholders], 'Check back next quarter, we didn't get this done.'"

The 1k SRE project: How ops work at Google will evolve

About 40 Google SREs in the company's advertising division manage 400 services related to that segment of the business, but that will probably grow to 1,000 services in the next three to five years without a corresponding increase in staff, Google SRE Nikolaus Rath said in a presentation.

Automation only gets you so far when you face that kind of service-to-staff ratio, Rath said. Google SREs also must rethink how they approach the role to handle it, and hand off more aspects of application management to developers.

"If we were in charge of that many services, we would spend very little time with onboarding," he said. "There would be almost no overlap in expertise between SREs and developers anymore."

Instead, Google SREs' value will lie purely in expertise about distributed systems and the maintenance of the production platform where services run. Service-specific tasks handled by SREs today from application deployments to production to incident management would cease to be part of the job description.

"What that means in practice is to us as the SRE team, all services look the same, just black boxes," Rath said. "We'd apply the same saying about treating servers as cattle and not pets to services -- our services are cattle."

Dig Deeper on Systems automation and orchestration

Software Quality
App Architecture
Cloud Computing
Data Center