James Thew - Fotolia

SRE software refines DevOps incident response for enterprise

SRE workflows often emerge organically within enterprise IT teams, but software tools add measurability and automation for incident response and review in one DevOps shop.

A software product for site reliability engineers might seem like a contradiction, as IT pros still mostly rely on tribal knowledge to build the role.

But site reliability engineering software that measures metrics related to incident management and automates postmortem incident reviews is worth the investment for one enterprise DevOps shop to fine-tune its SRE workflows.

Procore Technologies, a construction management software firm in Carpinteria, Calif., had previously used a collection of tools to hack together a process for incident management. Among these tools was Atlassian's Confluence collaboration software, which functioned as a document repository for incident response and postmortem information, said Stephen Westerman, senior director of engineering strategy.

"That was just a repository for those documents. You can't really report on important metrics, time to resolution and stuff like that [in Confluence]," Westerman said.

Atlassian has Jira Ops for incident management automation and postmortem analysis, which it integrated with the Slack ChatOps tool and bolstered with the acquisition of Opsgenie in September 2018. But, a few months earlier, Procore engaged with a stealth startup called Blameless, which it encountered at an industry event, with an early access product that was similar to Jira Ops.

Procore did not hear about or consider Jira Ops, Westerman said, but Blameless' Slack integration was a big selling point for the product.

"It allows our engineers and response teams to manage an incident directly from a tool they're already spending their entire day in," he said. "The Blameless integration with Slack allows them to quickly get all the right people into a channel, manage roles, checklists and incident statuses, build timelines and create follow-up actions on the fly, without having to navigate through another system that they only have to use once in a while in a stressful situation."

Blameless’ reliability insights dashboard
Blameless' reliability insights dashboards present an overall view of factors that drive reliability, such as incidents, postmortems, action items, log events and change events.

SRE software improves postmortem follow-through

We're much more confident that when it comes time for the postmortem ... nothing has slipped through the cracks.
Stephen Westermansenior director of engineering strategy, Procore

As incidents unfold at Procore, Blameless SRE software defines the tasks required of each SRE team role -- from the communications lead to the incident commander and the initial incident reporter -- and can tie in business stakeholders as needed. As team members enter the Slack channel, Blameless posts a summary, status and list of important events for them, so other team members don't have to catch up colleagues on the incident.

Incident postmortems are familiar territory for Procore SREs, but they previously required a team member to record and reconstruct events. The SRE software tool takes over that role, generates follow-up to-do lists, records code snippets and IT monitoring tools' graphs, and creates a customized timeline of the incident based on respondents' replies to its Slack messages.

The ready availability of incident management data makes Procore SREs more scrupulous about postmortems, Westerman said.

"It made the process a lot less painful," he said. "There's a lot less reconstruction to do. We're much more confident that when it comes time for the postmortem, we've actually recorded all the follow-up actions we need to do, and nothing has slipped through the cracks."

Blameless officially launched and made its product generally available on March 20, 2019, with $20 million in venture capital funding and 20 enterprise customers that include DigitalOcean and Home Depot. The company plans to add customizable dashboards to the product in April 2019, which should help small cross-functional DevOps teams at Procore improve mean time to resolution and incident response metrics, Westerman said.

Dig Deeper on Systems automation and orchestration

Software Quality
App Architecture
Cloud Computing
Data Center