Getty Images/iStockphoto

Atlassian cloud outage postmortem seeks to build back trust

A detailed post-incident analysis of Atlassian's cloud outage last month is prompting both the vendor and its customers to revise and expand their resiliency plans.

Atlassian issued a more than 7,500-word post-incident review about the causes of its recent lengthy cloud outage, pledging to redouble resiliency measures, as its customers also planned expanded safeguards for their SaaS data.

The post-incident review, published April 29, said the cloud outage -- which affected all of Atlassian's core cloud services for up to two weeks for some customers -- began April 5, after integration of a previously standalone Insights app into the Jira Service Management service. This was consistent with a previously published analysis from the vendor, though the more recent review said the number of affected customers was 775, nearly double the company's original estimate of 400.

The post-incident review also revealed deeper technical details about Atlassian's cloud provisioning process and exactly what went wrong, including an improperly formed script that wasn't caught in peer review, and a resulting data deletion that included the contact info for affected customers, hobbling communications from the vendor as the incident went on. This also meant affected customers couldn't open support tickets to address the incident as they normally would.

The review acknowledged flaws in the vendor's public communication about the incident as well, which drew additional ire from affected and prospective customers as the incident continued for days before any detailed public statements were released by the company.

Finally, the post-incident review detailed the steps Atlassian will take to ensure that the incident isn't repeated. Those plans include a new process of "soft deletes" to avoid errors during similar operations in the future, more extensive disaster recovery exercises and better outreach to customers and the public.

"We will acknowledge incidents early, through multiple channels ... within hours," the post-incident review stated. "To better reach impacted customers, we will improve the backup of key contacts and retrofit support tooling to enable customers ... to make direct contact with our technical support team."

One customer affected by the outage, who requested anonymity, said the deletion of contact data had delayed communications from Atlassian for much of the two-week period before services were restored on April 18. Faster communications, including public statements, would have been preferable, the user said -- even if they were preliminary or incomplete.

Atlassian cloud outage timeline
Atlassian's post-incident review of its cloud outage included a detailed timeline of events.

"Being in the dark for the first few days was very frustrating and concerning for us and our customers," the customer said.

Atlassian declined to disclose details of its plans to compensate affected customers for the violation of its cloud service-level agreements, citing privacy concerns, but this user said his organization's compensation will include a free year of tech support service. Atlassian cloud reps had previously stated that affected customers would not be billed for the month of April.

Atlassian cloud outage renews attention on SaaS backup

Atlassian cloud customers have long had the option of backing up their own data in case of a lengthy outage like the one that occurred last month, but analyst research has shown weak adoption of these practices. In May 2021, an Enterprise Strategy Group survey of 381 IT professionals found that 35% of them rely solely on SaaS vendors to protect their data. This result was similar to a 2019 survey by ESG, which the firm's analysts attributed to confusion within many organizations about the roles and responsibilities related to SaaS data protection. ESG is a division of TechTarget.

The user affected by last month's outage said his organization now plans to make its own backups of Atlassian cloud data.

"We will remain an Atlassian cloud customer for now but will be making our own backups going forward as well," the user said. "There are export features already, and we [now] plan to use them extensively."

We will remain an Atlassian cloud customer for now but will be making our own backups going forward.
Atlassian customer affected by last month's cloud outage

Another Atlassian cloud customer said the outage hasn't soured him on increasing use of the company's SaaS in the future, in keeping with its cloud migration push over the last year. The ongoing campaign to shift customers to Atlassian cloud services included announcements in late 2020 that Atlassian will discontinue Server on-premises editions of its products and raise prices on Data Center editions, and last year's licensing price breaks for enterprise customers that switch to Atlassian cloud.

"The opportunity cost -- the actual cost of maintaining two separate codebases with different lifecycle product -- is tremendous," said Chris Riley, senior manager of developer relations at marketing tech firm HubSpot, which uses Jira Software Cloud but was not affected by the outage. "And I don't think it benefits anyone."

Riley also personally uses Atlassian's Trello service, which he backs up himself using its data export features. He said he believes this outage may prompt more Atlassian cloud users to do the same but shouldn't have an effect on its overall cloud migration strategy.

"It certainly makes you step back a little bit and think about data integrity strategies, and I'm sure that that's what a lot of organizations are thinking about," he said. "But it doesn't reduce my confidence in Atlassian -- as a matter of fact, it increases my confidence to some degree, because going through something like this will undoubtedly make you add rigor in operations around incident response and application development, which should in turn [improve] the quality of the products."

Atlassian has further programs in the works to allow customers to back up their own data if they wish, an Atlassian spokesperson said in an email to TechTarget, but "the majority of our R&D investment is to ensure that our own cloud backup and recovery processes are world-class and robust, negating the need for customer-run backup processes."

Analyst: Atlassian should expect cloud migration pushback

One industry expert said Atlassian should still expect increased resistance to its cloud migration plans because of the outage, and potentially adjust the timeline for it accordingly. Atlassian's post began with an open letter from the company's co-founders and co-CEOs that emphasized the reliability of the Atlassian cloud in previous years and gave no indication the vendor will change its cloud plans, but that approach was ill-advised, said Will McKeon-White, an analyst at Forrester Research.

"They probably will point to the previous years without issue as an example, but that is fundamentally beside the point here, because there was a seismic problem," he said. "They were planning to use this year, judging from the Team 22 [conference] materials, to talk about how much more stable the cloud platform is, and how much better performing, [which] would have allowed them to commit to [the original time frame] for end of life for the Atlassian Server offering. Now, I think they should at least postpone it a year."

It's highly unlikely that Atlassian will experience the same magnitude of outage again, McKeon-White acknowledged, and he praised the fine-grained detail it included in the post-incident review. However, the open letter portion that emphasized past cloud reliability seemed addressed more to investors than concerned customers, he said.

Most IT pros will have empathy for the incident responders in this case, McKeon said, and Atlassian's cloud will retain the vast majority of its existing customers, who weren't affected by the outage. But at least some customers and prospects will cast their lot with a competitor as a result.

"Even in certain circumstances where the [cloud] failure is fundamentally on [the customer], people still do start looking for the exits," he said. "There's more trust-building that Atlassian will have to undertake -- they did some of that [with the post-incident review], and they need to continue."

Beth Pariseau, senior news writer at TechTarget, is an award-winning veteran of IT journalism. She can be reached at [email protected] or on Twitter @PariseauTT.

Dig Deeper on Systems automation and orchestration

SearchSoftwareQuality
SearchAppArchitecture
SearchCloudComputing
SearchAWS
TheServerSide.com
SearchDataCenter
SearchServerVirtualization
Close