With the sudden upheaval to so many lives, online usage has changed a lot. With reports of Netflix and YouTube downgrading quality in Europe, Microsoft Teams seeing a 775% increase in usage in Italy, broadband and mobile providers lifting caps, and lots of anecdotes of issues with services and applications not working as normal. We decided to investigate whether the internet can cope, what’s the capacity, where might it break?
Whilst writing this, I realized that my own understanding of the basic networking and load on the internet was (ahem) flakey. So, I took myself off and talked to some networking folks at some of the biggest vendors in cloud, streamed media, and content to learn about the new reality of the many, many people are working from home due to COVID-19.
The big picture
Nobody I talked to seemed to have much concern that there would be a significant worldwide problem with the fundamental bandwidth capacity of the network, but they highlighted a few places where minor issues might be seen and bottlenecks that might show themselves.
Fundamentally, the internet was built to work around issues. It evolved so that everyone is essentially connected to everyone else in a web, so you can go around a particular issue via a different path. However, these days that costs money; the routers and switches needed to divert traffic sit in datacenters and use electricity that someone has to pay for, routing costs money, and as such, complex business agreements have arisen between organizations regarding whose traffic they'll take and what route is prioritized.
To understand the internet, we need to some extent understand not only the physical structure but those business relationships between those that own the infrastructure.
Globally, this means physical connections between international networks are made at mega-hubs called Internet Exchange Points (IXPs), these are big data centers where ISPs (Internet Service Providers) and Content Delivery Networks (CDNs, and more on those later) exchange traffic between their independent networks. It's worth looking at a map of these to get an idea of the scale of the volume traffic they handle and where, for example, eight centers handle the traffic from all businesses and a population of 70 million in the UK. Major IXPs include London, Frankfurt, Amsterdam, New York, and West Coast U.S sites. The hardware in these centers is surprisingly simple, essentially a router plus backhaul (a connection back into the network) on 10GB port exchangers.
The backbone is essentially the mega-network cables that connect networks, the motorways of internet traffic. These are mega physical cables, undersea cabling or cross-country fiber, usually laid by Telcos such as AT&T or BT (British Telecom). Some companies specialize in owning the use (via leases) of these serious cross-country/sea fiber cable, such as CenturyLink (was Level 3), NTT Com (was Verio), AT&T and Verizon networks; these are known as Tier 1.
These large international links have huge capacity and are often beyond the budget of national providers (Hibernia built a straighter fibre across the Atlantic in 2011 from Slough to New York for $300m)and are frequently owned by large international consortiums of telcos, e.g.,SEA ME WE 4 (between France and Singapore) is owned by 16 different telcos.To give an idea of the capacity of the international backbone SEA ME WE 5 has just gone into operation with over 36Tbps available.
Tier 1s sell traffic access to smaller networks known as Tier 2s. Tier 1 companies have a small handful of B2B customers and do not sell to the domestic customer or businesses. It is the Tier 2s that sell to an end customer, often providing services such as Broadband, for example: British Telecom (BT Openreach), Comcast and Easynet (these companies frequently provide mobile services, too). The distinction between Tier 1 and Tier 2 is now slightly blurred as a few of the larger Tier 2s have invested in what was traditional Tier 1 cross-country fiber as well as mobile infrastructure.
The ISP networks connect from the IXPs a network of exchanges down to your local exchange and it's from your local exchange that traffic reaches your house, your neighbors, and local businesses. You may want to check out the current and historical usage of the IXPs; looking at the London IXPs usage, it's actually very difficult to spot any effect from the COVID-19 situation. In the UK, nonprofit IXP, the London Network Access Point has cut prices in a move to encourage and help ISPs to expand capacity. The highly competitive UK market with a large number of nonprofit players differs significantly from the US market, will we see US vendors follow suit?
Now back to those CDNs, the best known are probably Akamai and CloudFlare, these content distribution vendors partner with the ISPs to install servers and hardware of their own in local exchanges, to provide functionality such as load balancing, security, and caching. All with the aim of alleviating bottlenecks and ensuring content is delivered as close to your house avoiding the need to go through the uplink and back to the IXP or backbone. For example, if everyone in your street is watching the latest Netflix title, that's likely to be cached in your local exchange and delivered locally. Having their own or partner CDN technology in exchanges hugely benefits content providers as it reduces their data center and networking costs, especially if they are using expensive public clouds like AWS.
The internet infrastructure within and closest to your home is in many ways the weakest final link. In many areas, there is often the potential for genuine contention in your local exchange. If every broadband user on your local exchange has a contract for 2MB down/1MB, and everyone simultaneously tries to pull down 2MB, the exchange uplink back to the IXPs doesn't necessarily have enough capacity.
If that demand stays within the exchange and requests can be handled there, then contention can be relieved. Similarly, if traffic can be routed within the exchange and avoiding the uplink, this bottleneck is avoided. This means that all that traffic gaming from the local teenagers playing amongst themselves at home at the moment is very unlikely to put pressure on the uplink or main internet structure.
Some have questioned whether all these folks watching Netflix and gaming at once will exceed the capacities of these local exchanges and cause problems. The experts I talked to seemed to think it unlikely, as these facilities are already sized with surge and peak usage in mind. Indeed here in the UK many Broadband providers have gone on record reassuring users they do not envisage issues, with peak evening viewing already 10x levels normally seen during the day -- there's plenty of capacity for daytime home working and gaming.
An additional factor that means that exchange and the bigger internet infrastructure is planned and implemented in large discrete quanta/units -- it's expensive to add hardware as needed and the ISP/CDNs install in big increments to provide for future rather than current capacity. This means there's mostly a lot of spare headroom and is the reason Broadband providers in the UK have been able to lift data caps. Nevertheless those local exchange uplinks are a weak point, and in Europe, Netflix and YouTube have taken preemptive steps to protect the uplinks by introducing a moderate reduction by capping the video quality.
The impact on the end user from the above measures is actually fairly insignificant, but it actually makes a huge impact on local exchange capacity. In understanding how, though, you also learn an awful lot about streaming.
Media streaming and entertainment
Streaming services, such as Netflix or BBC iPlayer, use something called an adaptive bitrate. It's a concept we in EUC are familiar with from protocols like Citrix HDX. As content is downloaded, the time taken to do so is evaluated and the protocol adjusts based on how well it is faring. TV streaming services like some EUC protocol configurations tend to go for the "if bandwidth is there to be had, then let's bump up the quality" approach. Typically, streamed TV and video is downloaded in four- or eight-second chunks and buffered, while the stream is evaluated as to how long it takes to download, say a four-second chunk. If that chunk takes only one second to download, the adaptive bitrate could be adjusted to a higher resolution that takes 1.3 seconds to download.
It is in fact these chunks that are cached in the local exchanges, so if it's a popular film, viral YouTube K-Pop video, or must-watch live TV show, it's likely these individual chunks are coming from a cache. With a range of pay-for-HD services available and varying network conditions, this means that even if everyone in your street is watching the same film, you might actually all be pulling down different encoded chunks (there's a different binary of each chunk for each rate). There's a different copy of the four-second chunk of the original film for every resolution it's encoded in -- the more different bit rates available, the more copies the exchange is handling. Whilst reducing the resolution has the obvious effect of lowering bandwidth usage, it is in fact the effect of vastly increasing the exchange cache to protect the uplink that probably led to the decision.
Streamed video and TV can also leverage significant buffering and builds in delay to facilitate QoS and continuity. Streamed media is often 30 seconds to one minute behind real time. For example, during New Year's Eve, streaming of Big Ben chiming or the Time Square Ball being dropped will be behind freeview TV (which is often 20 seconds behind real time) and even further behind analog (if that existed still) would be only around 0.5 seconds behind real life.
Of course in some areas (particularly rural), the pipe actually into your house may be rather legacy in nature and resemble little more than a copper wire. In those scenarios, there is little you can do about bandwidth, although DaaS and VDI helps, particularly to avoid uploading large data files/videos, as only the screen pixels are streamed and VDI and video streaming protocols are extremely good these days.
Local Wi-Fi is another local area where there is potential contention, which isn't all that unusual. Wi-Fi at 2.4GHz has 14 channels(although which are used varies by country), according to the IEEE802.11 specification, with most users landing on channel 6 or 11 on most routers, which means radio contention and interference between many users and channels can also interfere with each other. Contention leads to corrupt and/or dropped packets and jitter. Even things like baby monitors, microwave ovens, or putting your router on the floor can degrade home Wi-Fi (it was an issue the major vendors in EUC long ago acknowledged--Citrix bought Framehawk to address Wi-Fi issues and many may remember their excellent explanations of the issues.
With more folks at home, this is an area where you may see a few issues from genuine contention, your neighbor usually at work or your spouse microwaving some soup may be where the sharp-eyed work-from-home veterans spot a few anomalies. Local mobile networks may see increased congestion and pressure in residential areas during the day in areas where provision wasn't necessarily the best.
Virtual Private Networks(VPNs) are another area where users may well see issues (more later on signs of this) but it's worth knowing a little about how VPNs work. The VPN client on your machine creates a tunnel, traffic from your computer goes into a virtual network point, and then goes to a VPN server within the corporate network that then routes the traffic. This pipe is encrypted to be private as we want to avoid malicious third parties inserting bad stuff into the stream and also prevent them from reading it. This means to ensure no injection attacks, VPNs cannot accept data that is even slightly corrupted, and rather than recover, they have to drop and retry data. Since the stream is dependent on previously sent data, this becomes complicated; e.g. we don't want to reveal the same screen repeat or page reloads with the same encryption and so the retry layers become complex, especially when interacting with the TCP/IP layers.
Typically, the TCP/IP in browsers doesn’t know about the VPN retransmits, and so, conflicted views of what is new occur. The cumulative effect can mean packet drops are significantly magnified and amplified. Many VPNs actually spend a lot of time trying to figure out what their view of the world is and reset their brain, hence why the experience can be so poor. The lag introduced by a VPN is likely to be more noticeable in real-time use cases where buffering is not an option, namely conference calls. For users used to using Webex, Skype, or Zoom within a corporate network where there was a 10ms latency from user 1 to user 2 and 10ms response, that 20ms round trip could well become 70ms of latency as the round trip becomes 15ms from user 1 to the VPN server, 10ms out to Zoom, 10ms out to user 2, and then back on a reversed path.
Changing usage patterns
It's useful to think about how bandwidth and service usage is likely to change in order to evaluate where people are likely to encounter issues and differentiate between changed usage and increased usage.
Increased use of unified communication software (Zoom, Webex, Skype) by users who previously didn't use it. This includes schools and students for online learning, people having Zoom parties, staying in touch with isolating relatives, and having online meetings they previously had face-to-face.
Internet browsing and service usage changes. Whilst sites like supermarket online delivery portals, online household goods sites, education sites, and news sites are seeing a huge surge in demand, other sites are likely to see significant decreases such as travel sites, cinema booking sites, OpenTable restaurant booking-type sites, and airline sites. Overall people only have a finite amount of time and money to spend browsing, but the demands have shifted. Overall this suggests that hosting provider provision and cloud usage associated with this use case should balance out, providing individual providers have the ability to scale up and down. Small niche hosting providers specializing in specific sectors may potentially have cash-flow problems, so check your data can be retrieved if they disappear. The fact is that so many websites, e.g., major supermarkets/news sites, owe to good scaling architecture that allow them to auto-scale and maintain service by expanding their cloud resources and networking, often using Kubernetes or containerized architectures.
Corporate traffic moving from within the private network to outside and via the corporate network. The overall demands on the backbone are unlikely to change because of this but users may well encounter issues as utilizing local exchange uplinks more but also frequently using undersized VPN facilities. Many organizations never factored in more than 20% to 30% of their workers being remote and saw it very much as an afterthought, with little thought to mitigation mechanisms (e.g., split view traffic and SD-WAN solutions -- company traffic down VPN, other traffic like web browsing down a normal connection). VPNs bypass ISP content caches, so if you were watching Netflix at home, you would usually get popular content from the nearest Netflix CDN box probably in your local town for the large ISPs, over a work VPN though it would be delivered from the nearest CDN box to the VPN endpoint.
The same is true for Zoom and other video conferencing services. If your VPN endpoint is in a highly connected data center in the same country as the user this will be fine, however if it's in your office, probably not so good. Where your organization has thought to place their VPN servers can have a significant impact.
VPNs by their very design can add a tremendous overhead vs a VDI/DaaS solution by their very design. It's very likely many users are going to find "the internet is slow" although their problems will be associated with their remote accessing tools limitations. With reports that networking giant Cisco is having to ration VPN usage for its 100k+ employees and reports that Kaspersky is now enforcing data limits on the free usage thrown in with its AV products, we expect to see a lot of others finding VPN servers strained by volumes of home workers they were not sized for.
The choice between using VDI or a VPN or another technology is determined by many factors and several vendors who support both have written recent guides around such decisions which are a valuable read, in particular VMware have written "What's Better in a Pandemic: VDI or VPN?" whilst security vendor Hysolate have also released a whitepaper, which covers VDI, VPN, and three other alternative methods in "Coronavirus Work From Home Whitepaper" (this one does ask for email details).
Nobody I spoke to felt that an increase in video streaming and online gaming was going to be a significant issue as the capacity is future sized and designed to deal with huge surges such as the Super Bowl. Surprisingly many of the networking experts I spoke to recommended the Pornhub IT teams blog and it is indeed a useful overview of network usage with both sociological and bandwidth data sets.
Corporations cloud bursting and expanding to the cloud. The demand on the big clouds (AWS, Google, and Azure) is likely to soar with existing customers trying to expand their provision, new customers wading in and Microsoft themselves releasing many of their services and products free to assist organizations. Additionally services that relied on these clouds to burst which are in demand are likely to be cloud bursting. This is however more about the data center rather than the internet's capacity. Many are turning to DaaS after finding an acute shortage of laptops available and long supply chain delays. Data center hardware is also in high demand compounded by the fact China is only just ramping up after its own Corona problems hit manufacturing earlier in the year.
Our TechTarget colleagues at ComputerWeekly have a great article that goes into many issues more deeply, including VPN challenges, statistics on networking usage being seen and opinion from network operators.
The fundamental internet networking structure and capacity is in a very good place to handle COVID-19 demands. Poor remote working strategies and technology implementations are likely to impact many organizations though. Individual services, clouds (there have been a few signs of issues with Azure capacity in some regions), and data center capacities however may struggle; whether there is the CPU and storage in the right places in addition to the network is a question for future articles.
I was assisted with some of the technical details of this article by the technical team at a UK specialized security-focused ISP, Mythic Beasts (thanks particularly to Director Pete Stevens), probably best known for supporting the high availability demands of the Raspberry Pi organization and their launches.