Serg Nvns - Fotolia
Dozens of cancelled flights. Hundreds of delayed trips. Thousands of disgruntled passengers. One faulty router.
In the hours and days after United Airlines' high-profile network outage on July 8, a flurry of news reports, blog posts and tweets suggested that software-defined networking (SDN) could have kept the company online and its customers in the air.
Much of the speculation, unsurprisingly, came from SDN vendor representatives, who pointed to the vast complexity of traditional networks and their dependence on manually configured routers as possible culprits. Key among their arguments: Centralized control and automation could have prevented the failure -- and they warned that the industry hasn't seen the last of these network outages.
"Any complex task that has wide ranging interdependencies among many other complex systems (pretty much any network) is prone to errors," Brandon Williams, CEO of SDN orchestration vendor Cplane Networks Inc., based in Sunnyvale, Calif., wrote in a post on LinkedIn. "Who has the next United Airlines time-bomb in their network? Probably just about everyone."
SDN has benefits, but is not a panacea
Kumar Ramachandran, CEO at SD-WAN start-up CloudGenix Inc. in Santa Clara, Calif. -- while emphasizing that he does not have personal knowledge of the United outage -- said that SDN's model of centralized control allows for more accurate production testing and better change control, which can mitigate the risk of related network outages. He acknowledged, however, that the approach is not a cure-all.
"The underlying challenge with the complexity of networking and the routing models we have today is [that] when you have a fragmented control and management model, it really becomes more difficult to have operational procedures that will allow you to make changes to your network," Ramachandran said. "SDN is certainly not a panacea for all evil. I think what it really offers, though, is a big leap forward in terms of how IT can manage the network."
That's not to say that a network change caused United's failure. The airline did not release specifics, saying only that a router problem degraded connectivity for critical applications. In a brief emailed statement to SearchSDN, Brandon Mangold, enterprise network architect at United Airlines, described as "complete nonsense" the notion that SDN could have prevented the outage.
"Anyone who makes such a claim has no idea what they are talking about," he said.
Automation can only do so much
Some independent experts, such as networking consultant Ivan Pepelnjak, said that much of the networking industry needs a software-defined reality check. He pointed out that an SDN product can only do so much to stop human error -- such as when an administrator assigns a VPN to the wrong port on a router.
"You cannot cure stupidity," he said. "There will always be operator errors. Even if all the products work perfectly, we just cannot automate things to the level where an operator cannot make mistakes."
Ivan Pepelnjaknetworking consultant
United Airlines is far from the only major organization to suffer a headline-grabbing failure of tech. In fact, The Wall Street Journal and New York Stock Exchange (NYSE) also experienced significant outages on July 8; the newspaper blamed its downtime on technical problems, while the NYSE attributed its failure to a software glitch.
Nick Lippis, co-chairman of the Open Networking User Group, said that router problems and associated network failures will ultimately bubble up more frequently, as traditional networks struggle to keep up with the ever-growing deluge of data from the cloud and Internet of Things. He said he believes SDN's versatility and automation could help mitigate problems of network size and scale.
"[The] network management model … hasn't changed since the mid-'90s," Lippis said. "I think United was a nice wake-up call -- the New York Stock Exchange, another nice wake-up call -- for how do we now include networking in the initial configuration and ongoing management automation schemes that are now being developed for distributed computing.
"I'm not sure if an SDN controller could have prevented what caused the United incident, but [administrators] might have seen other things quicker. They might have seen issues with an application … or a really weird traffic pattern starting to emerge before that router went down. They might have seen storage starting to act up," Lippis said. "So, they would have a much more holistic view."
SDN's impact and potential overly romanticized?
Pepelnjak is more skeptical.
"I think [the potential is] still overly romanticized," he said. "If different systems were integrated [more tightly], then at least there would be no manual entry or copy-paste mistakes … You can prevent a lot of miscommunication. But for the major outages, I don't think that SDN will bring us major improvements."
Centralizing control could actually create a critical point of failure, resulting in a network outage on a bigger scale.
"The so-called blast radius is a lot larger when using SDN and fabrics," said Daniel Dib, senior network architect at NetSafe. "Imagine that the controller starts sending out malformed updates, leading to that all the devices under the controller may not be able to do forwarding … With a fabric, you essentially have a single failure domain."
Dib said that he is SDN-agnostic, viewing the technology as a useful tool that may indeed prevent some network outages. But what's more important, Dib said, is that enterprises improve organizational processes and human workflow.
"Any SDN solution will fail if the organization fails to change its workflow," Dib said. "Most failures are, after all, related to errors made by humans. SDN and automation [are] not strictly necessary to achieve a more robust network."
The steps necessary to properly implement SDN present another impediment, Pepelnjak said.
"People deploying these products are just too busy to deploy them properly," he said. "So, for example, VMware's vCenter has … dozens of access rules you can set up -- who can configure what. In many environments, there is no access control, because people who are managing the environment are too busy to put proper access controls in place, which means that anyone with fat fingers can do a lot of damage."
Are outages inevitable?
Networking admins on Reddit give SDN red light
SDN definition now includes automation and virtualization
SDN use cases evolving
Dig Deeper on Network infrastructure
APTs compromised defense contractor with Impacket tools
British Airways passengers suffer flight delays due to another IT glitch affecting London Heathrow
British Airways outage: Airline cancels weekend short-haul flights due to ‘technical issues’
SITA and Orange Business Services check in for SD-WAN at global airports