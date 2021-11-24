Border Gateway Protocol, or BGP, was an early suspect in Facebook's recent global outage, one of the largest and worst in the social media giant's history.

On Oct. 4, 2021, Facebook -- now Meta -- and its subsidiaries, including Messenger, Instagram and WhatsApp, disappeared from the internet and remained unavailable for roughly six hours. Public speculation quickly arose -- much of it on Twitter, where social media users flocked in Facebook's absence -- that the outage might have stemmed from a BGP error.

With Facebook, Instagram and WhatsApp all unavailable, users flocked to Twitter.

But, according to Facebook, BGP and DNS issues were just symptoms of the actual problem: a misconfiguration that disconnected the company's backbone routers. In other words, to err is human.

"The root cause of this was fingers," said Terry Slattery, principal architect at consulting firm NetCraftsmen.

How did the 2021 Facebook outage happen? During routine maintenance on the backbone network, an engineer trying to assess capacity fat-fingered a command and triggered a cascade of technical problems, according to Facebook Vice President of Engineering and Infrastructure Santosh Janardhan. He explained in a blog post that an internal auditing tool should have stopped the misconfiguration, but a software bug caused the control to fail. The faulty command executed across the backbone routers and disconnected Facebook's data centers. That, in turn, triggered the secondary DNS and BGP problems. When the company's DNS servers couldn't communicate with the data centers, they automatically withdrew their BGP route advertisements, essentially removing themselves from the virtual map of the internet. Suddenly, it was as if Facebook, Instagram and WhatsApp didn't exist. To make matters worse, Facebook's internal operations tools rely on the company's own infrastructure and DNS to function. Employees, therefore, couldn't access the systems they typically use to work and communicate, and the networking staff couldn't investigate or resolve the outage remotely via their usual methods. The New York Times' Ryan Mac reported that Facebook's internal tools were also unavailable during the outage on Oct. 4, 2021. Andrew Lerner, analyst at Gartner, compared this unfortunate sequence of events to sawing off a tree branch while standing on it. Jonathan Zittrain, professor at the Harvard John A. Paulson School of Engineering and Applied Sciences, tweeted that Facebook "basically locked its keys in the car." Engineers ultimately had to get inside Facebook's data centers to manually debug and reset routers and servers. But an employee told Sheera Frenkel, a reporter at The New York Times, that workers couldn't gain physical access to company facilities because the electronic badge system failed. Even under normal circumstances, Facebook's data centers and network hardware are heavily fortified, according to Janardhan. So, getting the right people on-site took time. A source told The New York Times' Sheera Frenkel that the 2021 outage locked Facebook employees out of company facilities.

But first, network hygiene Lerner said he has received several calls from network leaders asking what they can learn from the 2021 Facebook outage. But he cautioned that, before worrying about this kind of cascading super-outage, companies should first make sure they are practicing basic network hygiene. "I'll be honest: Most organizations are not doing the foundational stuff," Lerner said. He said those foundational tasks should include the following: tracking and backing up network device configurations in a central database;

instituting a configuration rollback plan;

automating network change validation; and

performing frequent network testing. Once they've covered the fundamentals, organizations can then turn their attention to headline-grabbing cautionary tales, he said.