Atlassian cloud outage could take weeks to resolve
As Atlassian's cloud outage neared the one-week mark, the company informed affected customers that it could take up to another two weeks to fully recover.
Updated April 12, 2022
An Atlassian cloud outage that has already persisted for nearly a week for affected customers could take up to another two weeks to fully resolve, according to communications from the company.
Arseny Tseytlin, head of product communications at Atlassian, confirmed reports that a full restoration of services may take up to another two weeks for the hundreds of customers still affected by the outage, which began April 5. Only 35% of some 400 customers that lost access to Atlassian cloud services including Jira, Jira Service Desk, Confluence and Opsgenie have had services restored so far.
"As part of scheduled maintenance on selected cloud products, our team ran a script to delete legacy data," according to the company's updated statement on the outage, sent by Tseytlin in an email to SearchITOperations on Monday night:
This data was from a deprecated service that had been moved into the core datastore of our products. Instead of deleting the legacy data, the script erroneously deleted sites, and all associated products for those sites including connected products, users and third-party applications. We maintain extensive backup and recovery systems, and there has been no data loss for customers that have been restored to date. This incident was not the result of a cyberattack and there has been no unauthorized access to customer data.
We know this outage is unacceptable and we are fully committed to resolving this. Our global engineering teams are working around the clock to achieve full and safe restoration for our approximately 400 impacted customers and they are continuing to make progress on this incident. At this time, we have rebuilt functionality for over 35% of the users who are impacted by the service outage.
We are doing everything in our power to restore service as soon as possible, but due to the complexity of the rebuild process for each customer site, we were unable to confirm a more firm ETA until now. While we are beginning to bring some customers back online, we estimate the rebuilding effort to last for up to 2 more weeks.
This long an outage without a precise recovery date is unusual among SaaS providers. While customers whose services have been restored have not lost data from before the incident, some online commentators felt that nearly a week without access to vital work systems constituted its own kind of data loss.
Fascinating to watch this outage continue. Even if Atlassian is able to recover all customer data, I think being unable to access your data for 4+ days constitutes data loss.— Brent Checketts (@bnchecketts) April 10, 2022
The affected customers represent a small percentage of Atlassian's overall customer base, but the damage to the company's reputation is already done, said Larry Carvalho, independent analyst at RobustCloud.
"A well-architected SaaS solution needs to be able to recover much faster -- human error is excusable, but customers lose faith when a vendor cannot recover quickly," Carvalho said. "Also, the risk of this happening again will always be on the top of the customer's mind."
Outage overshadows product news
Updated April 8, 2022
Atlassian launched three new cloud-based products at its Team '22 conference last week, but the cloud outage hobbled its core services and distracted industry watchers from the new releases.
Atlassian cloud products including Jira Software issue tracking, Jira Service Management ITSM, Jira Work Management, Confluence documentation, Opsgenie incident response and the Access single sign-on tool became inaccessible to some users April 5, and efforts to restore them continued as of midday April 7.
Atlassian company statement
"While conducting a routine maintenance script, a small number of sites were unintentionally disabled, which resulted in them being unable to access their products and data," the company said in an April 7 statement issued through a spokesperson. "We know our customers rely on our products to get their work done, and we are sorry for the disruption this has caused. We are working 24/7 to restore products to full availability."
In a later statement, the company said the incident was not the result of a cyber attack and there has been no unauthorized access to customer data. The company added that, while hundreds of engineers are working to recover the sites, it is also adding recovery automation to allow it to recover sites faster in the future.
"Due to the unique configuration of each site as well as the care we are taking to ensure safe data restoration, we estimate that full resolution could take days, though we expect customers to begin seeing restoration on a product-by-product basis sooner," the second statement said.
The company will publish a post-mortem after the incident is resolved, according to the latter statement.
Another update from Atlassian's official support account on Twitter April 7 appeared to indicate that some customers had suffered data loss.
We expect most site recoveries to occur with minimal or no data loss.— Ask Atlassian (@AskAtlassian) April 7, 2022
In response to an inquiry regarding the tweet, Tseytlin stated on April 8, "At this point in time, we believe that any potential data loss will be minimal to none. We are working hard to resolve the incident and get customers back online."
He added that "a small number of Atlassian customers are impacted by the incident: around 400, which is approximately 0.18% of our total customer base of over 226,000 customers."
Still, loss of data is among the worst-case scenarios for a cloud outage, concerned customers said.
"This is extremely concerning to us, as our mission-critical institutional knowledge lives in Confluence at this point," wrote one business customer in an email to SearchITOperations. The customer, who requested anonymity, added, "This message runs counter to the 'maintenance script has disabled a small number of sites' message we've been getting over and over again. This would also explain why recovery has taken days with so many engineers 'working 24/7.'"
Atlassian outage impact remains uncertain
It's too soon to tell how the outages will affect Atlassian's business, but industry observers agreed the timing was exceptionally poor given the company's ongoing emphasis on its cloud-based services during the past 18 months. Atlassian's public statements have been especially frank over the last year about the company's increased emphasis on cloud tools and added incentives for users to migrate away from its on-premises tools, where it has discontinued its midmarket Server editions and raised enterprise licensing prices.
"Many Atlassian products are from acquisitions, and moving to a subscription model while integrating each product is not easy," Carvalho said. "Multi-day downtime does not do well to convince customers to make a move."
Other experts, however, preferred to wait and see how long the outage lasts and how it's resolved before predicting its ultimate impact.
"It depends on how soon they fix it, how major the problem was and what promises they make going forward," said Andy Thurai, vice president and principal analyst at Constellation Research. "Any cloud, including AWS, will go through this. It all depends on how they handle it."
Atlassian had a poor reputation for reliability in its cloud services during an initial self-managed foray into SaaS years ago, but a move to microservices on AWS in 2019 and the introduction of enterprise security features and service-level agreements (SLA) did much to reassure early skeptics. The company has since had a good overall track record of cloud availability, and announced a 99.95% uptime SLA for its Enterprise cloud edition this week at the Team '22 conference, along with an early access program for scaling its cloud instances to support up to 50,000 users.
At least initially, the outage did little to change Atlassian users' existing views on its cloud products, whether they were positive or negative.
"You expect outages from time to time," said Chris Riley, senior manager of developer relations at marketing tech firm HubSpot, which uses Jira Software Cloud but was not affected by this week's outage. "But I actually can't recall a single outage [with Atlassian]."
Other IT pros said this week's downtime reinforced their reluctance to use Atlassian cloud for production apps.
"I typically only use Atlassian Cloud products for testing," said Rodney Nissen, senior Atlassian admin at Activision Blizzard, which uses Jira Data Center on premises. "The thing to remember with any cloud offering is that these systems aren't magical; they are just someone else's computer. They are subject to the same errors and problems that could plague any other system."
Beth Pariseau, senior news writer at TechTarget, is an award-winning veteran of IT journalism. She can be reached at [email protected] or on Twitter @PariseauTT.