Troubleshoot SAN issues to improve performance
Common SAN problems include compatibility issues, hardware failures and sluggish storage response times, but a few proven strategies can easily resolve these issues.
Storage area networks can be complicated and temperamental beasts. This is especially true when they're poorly managed. Troubleshooting is tough because a good design isn't always obvious, and Fibre Channel standards are just loose enough to make interoperability a concern.
Fibre Channel (FC) storage area networks have been largely displaced by iSCSI SANs as the block storage choice for many data centers. But while iSCSI is a lower cost alternative that is somewhat easier to manage, can use familiar Ethernet networking technology and might share an existing LAN, FC is still the protocol of choice when high-performance block storage is required. As such, it remains an important storage alternative in most shops, despite the emergence of other alternatives.
It's important to review common FC SAN issues in order to figure out how to diagnose and resolve the issues, or how to prevent problems from occurring in the first place.
A host of things can go wrong in a complex storage network. FC was built from the ground up to support networked storage systems, so while some general networking knowledge can be applied to its management, there is a significant degree of specialization required as well. It should also be noted that over the last few years, FC SAN vendors made array management an easier process by automating some functions and decreasing the number of steps required to do things such as LUN configuration.
That said, maintaining the performance of FC SANs can still be challenging, but based on the symptoms, narrowing a problem down to a probable cause in one of these areas should speed troubleshooting and resolution. Each failure type can be grouped into one of the following areas:
Although FC SANs have been around for nearly three decades, not all devices interoperate well. It's very common for many SAN issues to result from noninteroperable components. All storage vendors publish some form of a support matrix -- typically, referred to as a hardware compatibility list (HCL) -- where they document tested and supported configurations of storage array microcode, SAN switch firmware and host hardware/software. A SAN might operate without problems for some time using hardware or software not on the HCL, but the practice is risky and can make troubleshooting performance issues more difficult.
Exceeding the capacity limits
It's probably obvious that saturating SAN ports will cause bottlenecks, and those bottlenecks can transform themselves into application problems that might be difficult to diagnose. It's usually pretty easy to look at a host or storage port on the SAN and determine if it's 100% busy, but it's tougher to determine if an overloaded inter-switch link (ISL) is a culprit. Sometimes the I/O itself isn't a bottleneck, but instead limits such as fan ratios -- the number of host bus adapters (HBAs) zoned to a storage port -- and number of switches in a fabric are exceeded, causing connectivity issues.
FC switch vendors typically bundle software that can help detect bottlenecks and possibly even suggest resolutions. There are also third-party applications available, such as the SolarWinds family of products, NetApp's OnCommand apps and IntelliMagic Vision for SAN that provide insight into SAN operations to track and relieve bottlenecks. These third-party tools generally support several different storage brands and models, so they might be particularly useful in mixed-vendor environments. This class of tools has been around for some time, originally referred to collectively as storage resource monitors; they didn't catch on at first because of their complexity but have slimmed down while adding features and improving usability.
Incorrect configuration or zoning
Bad or incorrect zoning is one of the most common causes of SAN problems. Maybe it's because we change the SAN zoning most often. This might also be common because zones contain those tricky 16-digit hexadecimal World Wide Names (WWNs).
Flakey connections and cables
It seems that when fiber cables fail, they rarely fail completely. Instead, they die slowly with intermittent symptoms. On the way to the grave, they often give applications and administrators fits. These issues might be compounded, as there are several types of cable supported by most SAN environments, so monitoring tools that can return accurate results from a variety of cable media might be helpful.
Storage array configuration issues
Each brand of storage array is managed a little differently, but all share some basic concepts. LUNs must be created and assigned to a HBA through a front-end SAN port. Problems often arise when the storage administrator makes a typo in configuring the array. Manual creation of LUNs can be an intricate and tedious process, which makes it prone to mistakes.
Host configuration issues
A lot can go wrong on a server. The servers in a networked environment represent a large portion of the SAN component stack, including the volume manager, operating system, multipathing software, HBA driver, HBA firmware and HBA hardware. Each of these components must be configured per the storage vendor's specifications; any deviations from the vendor's prescribed process can cause problems. In most shops, server virtualization has increased the number of operating servers significantly. In addition to compounding server configuration issues, because of the sheer number of additional servers, a virtual server will likely require some special setup by server admins.
SAN hardware failures
Hardware failures are last on the list of common SAN problems because while it's usually the first place we look, it's rarely the problem. Today's SAN hardware is very reliable, but hardware does fail occasionally. Common failures that can affect host access are SFP port failures, port card failures and entire switch failures.
Sluggish storage response times
As demonstrated here, a storage network is a complex environment with many components that must be set up properly and monitored carefully, but performance issues might also be caused by the storage devices themselves. The data storage media will have a profound effect on overall SAN performance. Today, most storage arrays include at least SSDs, so tuning for performance might involve either moving to or from the solid-state storage or perhaps adding more SSDs. If high performance is required across a broad swath of applications, an all-flash array might be warranted. If you're stuck with a hard drive-only array that you need to squeeze some extra performance out of, traditional tweaks such as short stroking a disk drive might provide some extra oomph.
SAN troubleshooting requires an intimate knowledge of the desired configuration and the expected behavior of a particular system. When a problem occurs, it's helpful to home in on the issue by eliminating the properly functioning components in the SAN, hosts and storage.
- SAN. Have any SAN changes occurred recently? Ask around, check the SAN logs and compare the running configuration to the documentation. Are the SAN reporting events or errors related? Look for failed ports, recent port logouts or fabric rebuilds.
- Host. Can other hosts see the storage in question? Can this host see other storage? Is the HBA logged into the fabric? Have any recent host changes occurred? Are there any SAN-related messages in the hosts' systems message logs?
- Storage. Can other hosts see the storage in question? Is the storage port logged into the fabric? Have any changes occurred on the storage array recently? Are the storage array logs reporting errors?
All the above points of inspection are greatly simplified if change management software is being used. Change management apps can also help alert support staff to any servers or data stores that might be orphaned or might not be included in backup operations.
Avoid future problems
Check the support matrices
Make a regular practice of reviewing storage vendors' HCLs and other support materials to check your configuration against what's currently supported. Manufacturers are constantly finding new bugs that get fixed in new code. Check for any updates and make it a habit to keep your software versions current and supported -- it will help avoid a lot of problems.
Document the SAN
This one is huge. It's extremely important when troubleshooting a problem to understand what the original SAN environment design intent was. Make sure the documentation records hosts, HBAs, WWNs and where they connect. It should include the storage, storage ports and their WWNs. Finally, the SAN documentation should describe the fabrics, ISLs, zone sets, zones and zone members.
If the original design document doesn't exist, you should be able to use a SAN management or change management application to discover and inventory all network devices -- and, in many cases, key configuration information such as network address can also be included in the inventory.
Baseline the SAN performance
Unless you record what's happening on an average day, it will be tough to determine if a busy port is normal or the culprit during a problem. Minimally, record the average port utilization for every port in the SAN. If you use a SAN monitoring tool, it can probably do this for you -- in fact, once acceptable performance thresholds are established, most monitoring apps will send email or text alerts when those thresholds are breached. SAN monitoring apps also provide dashboards for real-time insights into network status and individual network components.
Plan your changes
To avoid administrator-induced outages, use the SAN documentation to define changes before they happen. If you're making any decisions about what to do when you're executing the change, you're doing it wrong. Also, it's too easy to forget to document a change after it has occurred. Some change management applications will also let you do "what if" analyses to test the effects of an anticipated change to the SAN environment or the storage systems connected to it.
Backup the configurations
After every day of SAN changes, back up and safely store the switch configuration. This will ensure that you can roll back changes quickly from a backup if a switch fails or gets totally messed up during a change. To be even safer, configure your backup application to regularly back up all key config files during daily data backup operations.
Troubleshooting SAN issues can be a relatively easy process when certain things are under control and the networking environment is well mapped. Make these best practices part of your daily SAN health regimen to prevent a bigger problem when something does go wrong.