Troubleshooting and identifying data storage performance bottlenecks
Enterprise data storage performance bottlenecks that can clog ports, controllers and disk drives require a mix of tools and IT expertise to find and solve. Learn how to troubleshoot the most common storage bottlenecks and how you can avoid them.
The most common locations for data storage performance bottlenecks and problems are the front-end ports into a storage array, the controllers and the disk drives. The challenge lies in figuring out which congestion points are causing an application to perform poorly, or if it's even the storage at all.
Virtual server technology can exacerbate the problem, raising the prospect of overcommitted resources, especially if communication is poor between the application, server and storage administrators.
But there are tools for pinpointing a storage performance bottleneck. These include element managers built into storage arrays, more sophisticated storage resource management (SRM) software or specialized performance monitoring applications such as Akorri Network Inc.'s BalancePoint and Tek-Tools Software Inc.'s Profiler.
But none of those tools can help if you don't know where to look.
"It's like a treasure hunt," said Valdis Filks, a research director at Gartner Inc. "It takes experienced people to do storage performance problem determination."
Five common scenarios that cause data storage performance bottlenecks
The following are the most widespread scenarios for storage bottlenecks:
1. Virtual servers may be all the rage, but they also raise the specter of new and different scenarios for storage bottlenecks. Setting up a virtual server environment with too many virtual machines (VMs) per data store is a common mistake that leads to bottlenecks. Another stems from the movement of VMs from one physical server to another, according to Dave Bartoletti, a senior analyst and consultant at Taneja Group.
Using VMware Inc.'s technology, IT organizations typically move virtual machines manually through the vendor's VMotion utility or automatically through its distributed resource scheduling (DRS) capability. Bartoletti said an administrator might, for instance, tell DRS not to let any individual server exceed 60% CPU utilization or 75% memory utilization, and if either does happen, to rebalance the virtual machine or shift it to a less loaded server. As the VM moves to a different host server, it may access storage through a different LUN that another host is using to access storage.
"You could have more virtual machines than you planned for all hitting a particular storage controller, and your storage admin might not know what happened," Bartoletti said.
He advised administrators to think about the number of virtual machines they can support per array port, as well as the disk's read/write performance because virtual machines may make different types of requests depending on the applications they're running. One or two especially busy VMs on any given physical server can consume most of the CPU, memory, network bandwidth and disk I/O.
If one or two VMs are running especially I/O-intensive applications and hammering a disk, other VMs that use the same data store may suffer the effects of disk I/O contention. This problem is more difficult to identify than overloaded CPU or memory, according to Brian Radovich, lead product manager at Tek-Tools.
"Sometimes 20% of your VMs can consume 80% of your resources," he said. "You need something to identify what those are."
VMware's Storage VMotion, released in December 2007, addresses the issue, allowing an administrator to migrate running VM disk files off of the overworked LUNs/arrays. The older VMotion utility provided the ability to migrate a running VM from one physical server to another, shifting its memory and CPU utilization, but the storage remained on the same volume.
"If a VM is not able to get the appropriate amount of physical memory, it has to swap to disk which leads to more IOPS [I/Os operations per second] to the storage array," Tek-Tools lead product manager Brian Radovich wrote in an email to SearchStorage.com. He emphasized the importance of understanding the memory and disk I/O generated by VMs in order to optimize and eliminate unwanted IOPS to the storage array.
Radovich said an application running in a VM suffers if another VM in the same data store prevents access to CPU, disk and memory resources in a timely manner.
2. When many users share access to a business application, whether an email server, an enterprise resource planning (ERP) system or a database, requests may build up in the queue. The response time for each I/O starts to increase, short delays turn into bothersome waits, and calls to the help desk ensue.
The general profile for such a response-time-sensitive application is many requests, random in nature, more reads more than writes, and small I/O size. Optimally, the workload is spread over many drives. If not, a bottleneck may result.
If the application adds more users, or if the application grows to require more IOPS, more drives will likely need to be added to the RAID group, or data may have to be striped at a different level over more drives.
But, Brian Garrett, technical director of the Enterprise Strategy Group Lab, pointed out that, even though "storage often gets pointed to as a culprit, most of the time, it's not storage. It could be the network. It could be an application or a server that's misbehaving."
3. Bandwidth-intensive applications -- such as data backup, video streaming/editing or security logging -- that tend to have few simultaneous users accessing large files or data streams encounter their share of bottlenecks.
To isolate the trouble spot, Garrett recommends administrators start at the backup servers and work their way down to the drives, because the problems could occur anywhere along that path.
"It's not always the fault of the storage," he said. "It could be the way the backup application is set up or the way the tape system is working. But [the storage team] will get a call when backups aren't finishing the way they used to."
If the bottleneck is traced to storage, it could be caused by an insufficient number of drives to service the I/Os, contention at the controller, or inadequate bandwidth at the array's front-end ports.
Performance must be tuned for different types of application workloads. Tuning for large files and streaming performance isn't optimal for small files and vice versa, and tuning for high performance of small files isn't best for large files and streaming performance, said Marc Staimer, president of Dragon Slayer Consulting.
"That's why there tends to be a balance in most storage systems, in which you try and find the right medium for your system," Staimer added. "You tend to either optimize for throughput or IOPS, but not necessarily both at the same time."
4. A drive failure in a RAID group. Performance degrades, especially with a RAID 5 scenario, as the system seeks out the parity data to rebuild. The rebuild operation impacts performance, more with writes than reads.
Even if the broken drive was the original cause of the failure, the controller can become the bottleneck as it keeps trying to serve data during the rebuild process. Performance returns to normal when the rebuild is complete.
5. A new application is deployed, and the volumes presented reside on the same drives that handle the busy email system. If the new application gets busy, the performance of the email system will experience the effects. The additional traffic eventually could overwhelm the drives.
Where can bottlenecks occur?
While easing enterprise data storage performance bottlenecks can require a mix of storage tools and IT expertise, identifying the most common areas for storage bottlenecks can be the first step in troubleshooting them.
Key metrics to monitor
Tek-Tools' Radovich remembers a time when array vendors stressed IOPS and throughput, or "speeds and feeds," but now the main metric that everyone wants to talk about is response time. "It's not how fast you can move the data," he said, "but how fast you can respond to the request."
Gartner's Filks said you can expect a response time of 4ms for 15,000 rpm Fibre Channel disks, 5ms to 6 ms for SAS disks, about 10 ms for SATA disks and less than 1ms for SSDs.
"If you have all Fibre Channel disks, and your response time is 12 milliseconds, something's wrong," Filks said. "It's the same thing if you're buying SSDs. If you have SSDs and your response time is five milliseconds, something is wrong. It may be a connection. You may have some faulty chips. Call the supplier. He will help you drill down to where the problem is, hopefully."
In addition to response time, other key metrics to monitor include:
- Queue depth, or the number of requests held in queue at one time; average disk queue length.
- Average I/O size in kilobytes.
- IOPS (reads and writes; random and sequential; average of overall IOPS)
- Throughput in megabytes per second.
- Write percentage vs. read percentage.
- Capacity (free, used and reserve).
Use storage performance monitoring tools to fix bottlenecks
Enterprise data storage performance bottlenecks can be addressed by a mix of storage-related tools, including operating system tools, SRM software, SAN monitoring applications and more. Read this story to learn more about using storage monitoring tools to address performance bottlenecks with storage monitoring tools.
Data storage performance tips and best practices
Here are more tips and best practices for ramping up the performance of your data storage.
- Don't allocate storage based simply on free space. Take into account performance needs. Make sure you have enough drives for the throughput or IOPS you need.
- Distribute application workload evenly across disks to reduce the chance of hotspots.
- Understand your application workload profile, and match your RAID type to the workload.
More resources on storage performance bottlenecks
Gain an understanding of common storage bottlenecks
Look at the most common places for storage bottlenecks
Find out about the shift to network bottlenecks with SSDs
For instance, Akorri's Network Inc.'s CTO Rich Corley recommends RAID 1 over RAID 5 for write-intensive applications "because when you're doing a write on RAID 5, you have to calculate the parity, and that calculation takes time to do. With RAID 1, the write goes to the two drives much faster. There's no calculation. There's no math involved."
- Match the drive type -- Fibre Channel, SAS, SATA -- to the performance you expect. Use higher performing hard disk drives, such as 15,000 rpm Fibre Channel, for mission-critical business applications.
- Consider solid-state drive (SSD) technology for I/O-intensive applications, but not for applications where write performance is important.
"SSDs are a great panacea for read performance, as long as it doesn't hit the bottleneck of the controller," Staimer said. "But it's not at all a panacea for write performance. In fact, it's slower than most enterprise-class drives."
- Seek tools that do end-to-end monitoring, especially for virtual server environments.
"In general, you need software that pierces the veil of the firewall," Staimer advised. "There's a firewall between the virtual side and the physical side. So, you need something that looks at both, end to end."
- Weigh the pros/cons of short-stroking to boost performance.
Formatting a hard disk drive in such a way that data is written only to the outer sector of the disk's platter can increase performance in high I/O environments because it reduces the time the drive actuator needs to locate the data. The downside of short-stroking is that a substantial portion of the disk drive's capacity is unused.
Cloud data storage can be more complicated than you think
How cloud-based in-memory computing helps overcome data storage bottlenecks