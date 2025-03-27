SANTA CLARA, Calif. – The BlueSky social media platform and its bare-metal backend had to cope with a mass exodus from X.com, while Squarespace had 10 months to absorb Google Domains -- and with 90 days to go, it was well behind schedule.

Those were the settings and stakes for the incident and project postmortems engineers presented during SREcon Wednesday, along with takeaways for their fellow IT pros. The bottom line: Planning is important, presenters said, but it's crucial to have a collaborative team that's willing to be flexible, adapt quickly and make sometimes risky changes on the fly.

"Oftentimes during an incident, if you make no decision, it can be worse than making the wrong decision," said Jaz Volpert, backend Go engineer at BlueSky Social PBC. "Decisiveness is very important, and you must weigh the costs of doing nothing versus the cost of trying the thing that you can try."

BlueSky survives Election Day and Backhoe Day BlueSky has seen steep growth since Volpert joined the company about two years ago. Then, the fledgling, decentralized social network had about 100,000 users. Now, it has 33 million. Much of that growth was spurred by changes to X.com, formerly Twitter, after its purchase by billionaire Elon Musk in 2022 since BlueSky offers a similar microblogging interface. When X announced a change to its terms of service in October 2024, mandating that user data help train its AI models, BlueSky's hyperscale growth shifted into its highest gear, according to Volpert. In less than a month between October 2024 and November 2024, in the immediate wake of the U.S. election, BlueSky went from a daily peak of 5,000 requests per second to 50,000. "I was at the bar one night, and I was watching a report from a Golden State Warriors game, and I saw a BlueSky post on TV, and I said, 'Huh, that's strange,'" Volpert recalled. "A few days later, we saw BlueSky rocket to the top of the free apps [list] in every app store in the U.S., Canada, the U.K. … lots of major markets." BlueSky still operates with a small team of 21 full-time employees. Volpert said that in that month of explosive growth, including 11 straight "days of hell" following the U.S. election, about half a dozen people spent more than 16 hours per day in a situation room. Compounding matters was an event that one of Volpert's slides called "Great American Backhoe Day," in which a fiber cable was cut at one of BlueSky's data centers, affecting 50% of its users. This was the point where the cost of indecision was greatest, according to Volpert. An initial attempt to fail over traffic to another data center thrashed its databases to the point where service was degraded for all of the platform's users and had to be rolled back. "What have we learned from that? Well, it's best to roll with punches," they said. "We had tried this failover before at smaller scales. It worked fine. We had never tried it at this scale before, but there's a first time for everything, so we learned the hard way." The company also manages its backend hardware on bare-metal servers, which meant it had to cope with that month's spike in demand with fixed server capacity. BlueSky had previously run in the cloud, but its bandwidth-heavy, compute-heavy service made the cloud dramatically more expensive, according to Volpert. And some of the most significant failures the BlueSky team encountered had to do with systems and software the company didn't control. Other issues BlueSky engineers encountered during their 11-day situation room marathon arose out of stress and exhaustion, such as misconfigurations in proxy server deployments. "From that [we learned]: Automate all future proxy deploys with human approval. Make sure that's automated. Make sure at least two people take a look at it," Volpert said. "I'm sure we're going to break this rule at some point in the future when something crazy happens again, but we can at least try to practice it so that it's easy to do under pressure." The small team also performed admirably when it came to dividing up work into pairs and trios to try multiple solutions in parallel, which allowed for quick responses, they said.