Site reliability engineering (SRE) is the application of scripting and automation to IT operations tasks such as maintenance and support. The goal of SRE is to swiftly fix bugs and remove manual work in rote tasks. In some information technology (IT) departments that use site reliability engineering as a job title, the development team is split into developers and SREs. A site reliability engineer may work with the developers to design and engineer software, and work with IT operations team members to manage and support the software.
Site reliability engineering seeks to improve the reliability of currently operating software, while minimizing the work involved in its upkeep. Automating as many tasks as possible allows operations experts to provide strategic, higher-level work, such as planning a new deployment or creating a pipeline for faster product feedback.
A service level agreement (SLA) may be enacted for the SRE team that specifies a certain level of reliability required of the software -- for example, 99% uptime. This gives the SRE team a 1% threshold for errors, bugs or downtime. While this SLA structure seems similar to that in any operations team at first glance, the primary difference lies in the role of SRE professionals: If the code written to automate ops tasks allows software services to operate at the agreed upon level, SREs are free to continue developing more code to further improve the software stack. If, however, services and applications experience outages, or lagging performance, then the SREs are required to fix the issues immediately before tackling other projects.
SRE and DevOps share the same core principles -- keep a diversely skilled team involved in each phase of software development from design through operation, automate any repetitive tasks, use of engineering tools in operations. While DevOps is a cultural framework that applies to positions both within and outside of IT, SRE occurs specifically to support IT operations during software development and deployment in production. Business leaders are involved in DevOps, but not in SRE.
The history of SRE
Site reliability engineering relies on a management principle more than a century old; the people who create something should be equally responsible for ensuring its continual success. Google is credited with the application to website management in 2003 when they tasked Benjamin Treynor Sloss, current vice president of engineering at Google, to lead a team of software engineers in the creation and maintenance of a production IT environment. The goal was to keep Google's websites running as reliably, availably, and serviceably as possible. Treynor tasked this team with spending half of their time on operations tasks to gain a better understanding of software in production. For Treynor, SRE is the result of allowing a software engineer to structure the subsequent operations functions -- effectively creating a NoOps environment. Companies employing this method include Dropbox, Mozilla, Netflix and LinkedIn.
Site reliability engineering skills
Typical qualifications for SRE positions include a bachelor of science in computer science or a related discipline, or equivalent experience; an understanding of container technology, web services, databases and related infrastructures; expertise in using scripting languages; experience with platforms and operating systems such as VMware ESXi and Linux; and a thorough familiarity with networking. Experience in system administration and cloud computing are equally important to this position, as well as the flexibility to work with both operations engineers and software developers, which requires fluid interpersonal skills.