Andrea Danti - Fotolia
BROOKLYN, N.Y. -- IT ops pros who want to escape infrastructure toil face common conundrums as they begin the skills shift from sysadmin to site reliability engineer.
Site reliability engineer (SRE), a role for IT ops practitioners that sprang out of DevOps, answers the question of what system administrators do when, theoretically, software developers manage and troubleshoot their own applications in production. The SRE role calls for strategic thought about infrastructure management, proactive infrastructure automation to cut down on repetitive, inefficient work, and measured improvements in the overall reliability of the IT environment.
That was the ideal among attendees at SREcon here this week, but for many, it's much easier said than done: They're still too busy with tactical work, referred to as toil in SRE circles, to create the strategic improvements that might reduce the amount of tactical work they have to do.
"How do I reduce interrupts enough to do my interrupt reduction projects?" was the summary of an SRE at an athletic equipment retailer at the conference who requested anonymity. The SRE has been able to create self-service tools, such as Terraform infrastructure-as-code templates, that standardize infrastructure provisioning for developers, but time for more projects has been elusive, he said.
"We're building tools, but we're not working with error budgets, service-level objectives or chaos engineering," the SRE said, in reference to advanced site reliability engineer techniques under discussion here this week. "The biggest challenge is time."
Site reliability engineer skills require strategic thought in a tactical world
In some ways, the transitional struggle described by the SREcon attendee is unavoidable, according to experienced SREs who presented here this week.
"If you talk to experienced veterans in the field, they might get a faraway look in their eye and say, 'Oh, yes, I remember that,'" said Jaren Glover, infrastructure ghostwriter at Robinhood, a fintech startup in Palo Alto, Calif. "A bit of this pain is par for the course."
There are, unfortunately, no easy solutions to the problem, SREs said, though support from employers to hire new engineers and scale up site reliability engineer teams is crucial.
"It's also a matter of prioritization," said Arnaud Lawson, senior infrastructure software engineer at Squarespace, a website creation company in New York, in an interview after his SREcon presentation on service-level objectives. "Even if 80% of the team is dedicated to firefighting, the rest can tap into automation to get rid of tedious work."
At large enough companies, such as the professional networking site LinkedIn, SREs are sometimes repurposed from other teams to help those that struggle to meet team performance targets or who are overwhelmed by pager alerts.
There also may be other IT personnel, such as network operations center or help desk professionals, to whom aspiring SREs can hand off runbooks for repetitive break/fix tasks, which frees SREs to automate those tasks and eliminate them in the future.
"Make sure your executive management knows you have a clearly defined problem and exit criteria for the project," said Todd Palino, senior SRE at LinkedIn, based in Sunnyvale, Calif.
An ounce of automation beats a pound of manual remediation
In some cases, IT ops pros who want to become site reliability engineers must push back against organizational inertia in the process to demonstrate the value of proactive automation and interrupt reduction projects. That was the case for Tony Lykke, an SRE at Hudson River Trading, a financial services firm in New York, where IT operations staff received as many as 2,400 high-urgency pages per month in mid-2015.
"When paging is this bad, the problem isn't technical, it's cultural," Lykke said in an SREcon presentation about his work to reduce alert fatigue at his company. "It's also almost never a resourcing problem, but a prioritization problem."
Arnaud LawsonSenior infrastructure software engineer, Squarespace
Lykke introduced deployment templates for the company's Nagios IT monitoring tool that cut down on variation and errors in how it reported on thousands of systems, and added Python scripts that sorted through Nagios alerts before it forwarded them to PagerDuty for on-call IT staff. The scripts cut out unnecessary alerts, such as those that signaled high CPU utilization on the company's machines during routine data consolidation at the end of the U.S. trading day. By mid-2018, high-urgency pages for some teams had been reduced by 75%, and the overall number of pages stood at about 1,000 a month.
That technical work was the easy part, Lykke said. He also had to wean IT ops pros at his firm off the constant flow of pages, because some felt that a lack of communication meant something was wrong. He gave those who struggled with the reduction in alerts access to an unfiltered alert feed via Slack, and gradually, they stopped compulsively checking the alerts on their own.
Lykke advised others in similar situations to devise a game plan with specific steps, and to communicate early and often to management about what that plan will accomplish.
"Anything is quantifiable, including the cost of time spent responding to outages and performance degradations," he said. "Use the data you've collected to demonstrate the value of your plan, and be annoyingly persistent."