Editor's note: This story was updated Jan. 23, 2023.
Corrupted data in the Federal Aviation Administration's technology infrastructure should make enterprise IT teams consider and test high availability capabilities and recovery timetables for their own stacks.
Thousands of flights across the U.S. were grounded on Jan. 11 due to an outage of the Notice to Air Missions (NOTAM) database for several hours that morning. The NOTAM system warns pilots of hazards, runway closures or military exercises in progress.
In a prepared statement by the FAA posted to its website on Jan. 19, the agency wrote that preliminary investigations determined a user error as a potential cause for the outage. A contract worker "unintentionally deleted files while working to correct synchronization between the live primary database and the backup database."
The FAA also said no evidence of malicious intent or a cyber attack was found and has "taken steps to make the NOTAM system more resilient."
Database failures due to user error, unforeseen corruptions or misconfigurations are not uncommon, according to data storage and recovery experts. To avoid such catastrophes during business operations, companies should implement high availability protocols for systems in production and backup.
High availability is the capability of a system to operate continuously without failing for a set period of time, maintaining agreed-upon operational performance for an enterprise. High availability standards are critical in industries where lives are at risk, such as the medical and transportation industries.
The public-facing nature of this failure and its associated business impact shows how important testing and successful recovery verification is, according to Steve McDowell, senior analyst at Moor Insights & Strategy.
"These kinds of corruptions happen every day for applications," McDowell said. "It's too early to point fingers, [but] it surprises me they'd have corruption within backups."
Outage and 'ground stop'
Pilots must consult NOTAM before takeoff, according to FAA regulations. NOTAM began experiencing issues on the night of Tuesday, Jan. 10, eventually leading to a "ground stop" of all U.S. flights early Wednesday morning before service was restored around 9 a.m. EST.
On Jan. 11, the FAA stated the outage was due to "a damaged database file," but didn't offer specifics other than to say the administration is "working diligently to further pinpoint the causes of this issue and take all needed steps to prevent this kind of disruption from happening again."
Later that day, U.S. Secretary of Transportation Pete Buttigieg provided more information, explaining that issues arose between NOTAM's production and backup files for the databases, and echoed the FAA's promise to correct the issue going forward.
"When there is a problem with a government system, we're going to own it, we're going to find it, and we're going to fix it," he said during a press briefing.
Both the FAA and Buttigieg said at that time the issue did not appear to be due to a cyber attack, but they are working with the FBI to be sure. In its Jan. 19 statement, the FAA said it still has found no evidence of a cyber attack but that it "continues to investigate the circumstances surrounding the outage."
Buttigieg said the NOTAM outage will be a "data point" in upcoming congressional hearings as a need for additional funds to maintain the FAA's IT systems as well as a review of system policies.
"We need to understand if this reflects a systemic issue and what would be required so there isn't a single point of failure here," he said. "There need to be redundancies [and] layers and layers of protection here."
Despite calling NOTAM "many years old," Buttigieg said the system is standardized across the globe with other nations' systems.
The downtime of only a few hours indicated the FAA's technology team was able to rebuild under pressure, according to Christophe Bertrand, practice director at TechTarget's Enterprise Strategy Group.
"The [FAA] IT team had a lot of heat on them, but they recovered," Bertrand said. "They had to make a decision given the nature of the system, ... but they recovered pretty quickly, all things considered."
Earlier media reports indicated issues with backups, which likely led to an unforeseen slower recovery time and the need for a full reboot, according to Bertrand. Even if the classic trick of unplugging the system and plugging it back in is a worst-case scenario for the enterprise, testing recovery times is an important part of digital hygiene.
"They know what the RPO [recovery point objective] is on their database," he said. "They should be testing this on a regular basis."
Logical corruption of databases leading to issues is common, but enterprise IT teams can build in high availability clusters for database duplication and testing with 1-to-1 IT stack copies to test failure points, Bertrand said.
Officials investigating the downtime will likely need to look into high availability processes, how testing can occur without interrupting operations and how to verify backup integrity, he added.
"Once we understand what happened, it will be more evident if they were doing something wrong," Bertrand said. "It could be the nature of the problem is there was no other option."
Even if the FAA is following best practices, technological limitations due to outdated hardware could have also kept the administration from implementing modern standards and recovery times, McDowell said.
"What [the FAA] didn't say is if they recovered from a backup," he said. "This is an area where older architecture may not have these capabilities. We've morphed our [recovery] practices, but it doesn't seem like the FAA is there."
Tim McCarthy is a journalist living on the North Shore of Massachusetts. He covers cloud and data storage news.