Chepko Danil - Fotolia
A client contacted us regarding slow network performance after a network speed check. It needed to transfer large -- 0.5 GB to 1 GB -- computer-aided design files from its central storage system to remote workers, some of whom were as far as 15 milliseconds (ms) away. Before spending a lot of money on 1 Gbps WAN links, it wisely decided to verify the desired operation using a WAN simulator. The file storage system was connected to the data center switch via a 10 Gbps Ethernet link. The client connection was a 1 Gbps Ethernet connection from the switch to the WAN simulator, then via another 1 Gbps link to the client. See Figure 1.
The customer wanted to achieve 800 to 900 Mbps of delivered data, with no other significant traffic volume competing for the link bandwidth. The WAN simulator was initially set up for 0 ms of latency. The file transfers proceeded as expected, running at the desired throughput. However, when the anticipated 15 ms of latency was introduced, the throughput was significantly reduced. The best throughput to the client was about 420 Mbps and was frequently as low as 100 Mbps. Why was there such a significant difference in performance in the network speed check?
Looking at the data
We obtained packet captures at the file storage system and at the client, and we imported them into Wireshark. Analysis on a per-packet basis was not going to be useful due to the number of packets in the 731 MB file transfer. Instead, we used the TCP sequence space graphing option of Wireshark -- select a packet in the flow, then Wireshark Statistics >TCP Stream Graphs > tcptrace. The overall sequence space graph looks OK.
But looking closer, we find that there is a significant amount of packet loss a few seconds into the transfer.
The transfer starts off fine, but then encounters a bunch of packet loss. It takes about three quarters of a second for the systems to recover and resume the transfer. The rate of transfer is lower after the packet loss, as indicated by the change in slope of the sequence number graph. Our analysis showed no further packet loss. The transfer of 731.5 MB took 17.28 seconds, or 42.33 MBps -- 338 Mbps of user data. We expected to see the transfer complete in about seven seconds, not 17 seconds. This was one of the better throughput tests. Sometimes, the tests show throughput as low as 70 to 100 Mbps.
Analysis of the network speed check
The storage system started sending data and thought it had a 10 Gbps path to the client, although with 15 ms of latency. That's a bandwidth-delay product of about 18 MB. When the buffers in the network equipment fill, a lot of packets are dropped. It took the storage system about 700 ms to retransmit the lost data. It should then resume transferring data with slow-start and ramp back up. Further analysis was required.
We had the storage vendor look at the system while running another test. It reported the congestion window parameter in the storage system's TCP code was reduced to a value of 1 as a result of the packet loss. IT also indicated the TCP stack in the storage system is based on the TCP Reno code, which is a very old implementation. The internal congestion window gets set to 1 whenever significant packet loss occurs. The congestion window is not transmitted as part of a packet, so it required monitoring the storage system internals during a transfer to detect that this was happening. TCP Reno uses an additive slow-start algorithm to ramp up the congestion window size, so the transmit window increased by one packet for every round-trip time.
With the 731 MB file, the storage system never encountered congestion again, simply because the remaining file-transfer time was not large enough to grow the transmit window to the point where it would cause further packet loss. Looking at Figure 3, there is a difference in the geometric ramp-up before the packet loss and the additive ramp-up after the packet loss.
The storage system thinks it is running over a 10 Gbps path as it ramps up. The switch then drops a bunch of packets because its internal buffers fill. The storage system's TCP stack cuts the congestion window back to one packet. It takes about 700 ms for the storage server to recognize and retransmit the dropped packets. The slow-start mechanism to increase the transmit window size uses the additive mechanism in which one packet is added to the transmit window for each successful round trip -- for example, the window advances. For 700 MB files, it never reaches the point of congestive packet loss. Research on the Reno TCP stack found numerous references to problems operating over high-latency networks.
It's not the network
In this case, the problem revealed in the network speed check was not a network problem; rather, the culprit was the old TCP code in the storage system controller. This problem is a simple congestive overload of a low-speed link -- the 1 Gbps link to the remote client -- by a high-speed source, or the 10 Gbps interface between the storage system and the switch.
Adding buffers to the switch in the path would only delay the time at which the packet loss would occur. The same problem can occur when several sources are congesting a single link. It is best to use quality of service (QoS) to prioritize traffic and discard less important packets. If that's not possible, then use weighted random early detection to begin discarding as soon as congestion starts to build, providing negative feedback to the sources.
An interesting aspect of the problem is that transfers run at 800 Mbps if the 10 Gbps link is replaced with a 1 Gbps link. The storage system doesn't encounter significant packet loss and, therefore, doesn't reduce the congestion window. A little packet loss occurs as the systems reach the link capacity, but not enough to cause the storage system's Reno-based code to shut the congestion window and switch to additive slow-start.
What about workarounds? We postulated several different workarounds to the speed mismatch:
- Use QoS to police at 1 Gbps to emulate a 1 Gbps path;
- Use the 10 Gbps Ethernet pause frames to tell the storage system to delay sending;
- Configure a 1 Gbps interface and use policy routing, domain name system and Network Address Translation to force the remote client traffic over the 1 Gbps path. One of the vendors assembled its own tests that validated what we were seeing, and it reported the pause frames and QoS didn't work. (This result was surprising, as we were sure that policing at 1 Gbps would work. We want to obtain packet-capture files of QoS policing tests to try to understand why it didn't work.) Our proposal to use a 1 Gbps interface between the storage system and the switch seems to be the only viable option.
This was an interesting problem to diagnose. The customer is deciding how to proceed. We are continuing to think about the problem and will analyze the QoS packet flows if we can get a packet-capture file.
Interestingly, we became aware of another engineering firm that had a similar problem, shortly after delivering our analysis report. Its solution was to move its data from in-house storage systems to a cloud-based storage vendor, with Panzura caching systems at each remote site. It was able to switch to lower-cost internet connections with higher-bandwidth VPNs, which helped increase the throughput. We thought that was a creative solution, but it required a change to the business processes and infrastructure, which took several months to implement.
Optimizing TCP/IP traffic
APM tools can help diagnose network performance issues
Identifying and fixing network performance problems