How to diagnose and troubleshoot database performance problems

Database performance problems can wreak havoc with Web site performance and cost your customers lots of money. Diagnosing problems is much easier when you rely on this systematic approach.

Service provider takeaway: Database service providers should follow an established methodology when troubleshooting database performance problems.

Relational database management systems (RDBMSes), such as SQL Server, Oracle, DB2 and Sybase, are highly scalable and capable of responding to thousands of requests per second. Mission-critical applications are dependent on highly responsive database systems to provide their clients with sub-second performance. Unfortunately, performance problems on a database have the potential to drive your customer's Web users to another site, causing significant financial losses to the company.

It is essential that service providers use a proven methodology to diagnose and troubleshoot performance problems in customers' databases. Following a methodology enables you to approach the diagnosis of problems in an orderly, logical manner, which increases the chances that you'll find the exact cause of the problem quickly and accurately. Failure to use a methodology will result in the service provider attempting to solve symptoms with no clear understanding of what the underlying problem is, and whether the solution offered has solved the problem.

What causes performance problems?
  • Inefficient code: The most common cause of a performance problem is inefficient T-SQL code -- for example, not using correct indexes, doing cursor-based operations as opposed to set-based operations, or using inefficient joins. DMVs frequently allow you to isolate the stored procedures or queries that are high disk I/O or CPU consumers for further investigation.
  • Schema changes: Schema changes that are not load-tested can cause performance problems. Ensure that schema changes are not only regression-tested but load-tested in a representative environment under a representative load, so they do not introduce unanticipated performance problems.
  • Hardware failure: Occasionally a hardware component will fail on your RDBMS, which will degrade performance. For example, a network card may start transmitting bad packets that cause network performance problems. A hard drive failure in your RAID array will cause I/O to be distributed over the remaining drives in your RAID set, causing more I/O activity, slowing down your disk subsystem response.
  • Misconfigured driver: Check to ensure that your network driver settings are correct for your network, your server is configured as a network appliance as opposed to a file server, and your IDE/disk controller is configured for optimal throughput; frequently these drivers/components are not configured correctly. Sometimes driver corruption can cause performance problems localized to one machine or user profile.
  • Updated system components: Vendor patches may cause unexpected performance problems.
  • Disk I/O bottlenecks: These bottlenecks can be caused by fragmentation, stale statistics or poor index selection.

What changed?

The key question in diagnosing problems is: What changed? Service providers must be able to go back to a point in time when performance was acceptable to detect if a change to the topology (the application using the database, the SQL code or the database schema) has caused a problem. If there have been no changes to the system, the problem could be related to a hardware failure, driver misconfiguration or updated system component. Another possibility is that the load on the database has outgrown hardware capacity and the RDBMS is unable to respond in a high-performance manner.

With change management systems in place, it will be easy for service providers to isolate code or schema changes that are responsible for the performance problems and to fix them or revert back to an earlier version of the code. There are several tools available commercially for change management, such as Quest's Change Director and Idera's Change Manager.

Examining baseline and benchmarking metrics

Baselines and benchmarking will also help service providers determine what has changed on the system and what the bottleneck is. A baseline is performance data gathered during a simulated load or actual load. The same metrics can be gathered and compared with the current workloads to determine how these metrics have changed. These comparisons provide a quick window into bottlenecks. For example, comparing current performance counters with the baseline might reveal higher-than-normal disk I/O patterns. Memory bottlenecks normally manifest themselves as memory errors in the event log. Disk I/O bottlenecks typically manifest as transitory performance problems. CPU bottlenecks typically manifest as high CPU utilization, but disk I/O bottlenecks can also cause high CPU utilization.

Benchmarks are performance counters collected during a variety of loads. They are used to determine how your server will respond under load and what bottlenecks will exist. For example, you could run a load test to determine how the database system will perform during the holiday rush.

While baselines provide a basis to compare performance at various times throughout the lifetime of your points of comparison, benchmarks allow you to compare performance under various workloads. For example, you might want to have a benchmark for how the system responds during peak demand cycles. This benchmark can be compared with current performance counters to see if the system response is unexpected for that workload.

Let's look at an example of how this works. You have a baseline and a benchmark for various workloads for your RDBMS. You know your baseline for transaction per second (tps) is 1000 tps. Your monitoring reveals that it is dropping because it is the holiday season and your RDBMS is under peak load. You compare the current tps with the benchmark you have recorded for peak demand and discover that your current tps value is below the recorded benchmark. Clearly something is wrong and needs to be addressed. Further examination of the peak season baseline shows that the CPU counters are running much higher than you previously recorded. With this information, you can then quickly start diagnosing what is consuming the CPU.

Baselines and benchmarks can also quickly identify changing load patterns, which may dictate the need for more powerful hardware.

In summary, benchmarks and baselines provide a basis to:

  • See how your system response is changing with your current workload.
  • See if this change is expected or unexpected.
  • See if you are maxing out your hardware.
  • Identify the bottleneck.
  • Proactively address performance degradation.

Setting performance tuning goals

Once you have determined what the source of the bottleneck is, you should determine whether you need to upgrade hardware, change the architecture or improve indexing. At this point you also need to create performance tuning goals with customers. Interview customers to determine what they feel is the pressing performance problem, and then tune the system to address these goals. It is important to understand performance problems of the RDBMS before the client interview; otherwise you will be chasing symptoms of the problem rather than attacking the problem itself. You will also be in a position to tell the client what you feel is achievable and whether new hardware is required. That said, if you address only what you see as the performance problems, you may or may not solve the performance problems that the clients are seeing. The job will not be done until you have solved the performance problem from the client's perspective.

Using performance tuning tools

Once you have determined the bottlenecks and set performance tuning goals, you can begin the process of performance tuning. Tools such as profiler, perfmon or the SQL Server DMV (Dynamic Management View) will enable you to see what is going on in the system. In any RDBMS under load, many processes will simultaneously be running; some processes will be executing and other processes will be waiting for resources to become available. DMVs allow you to see which processes are not currently executing and what they are waiting on. The DMV will also tell you whether these waits are normal. The DMV can also be used to identify problem processes or queries, sometimes even at the statement level, and provide indications of how to solve them. Most frequently, you will be able to solve performance problems by adding indexes, rebuilding them or rewriting queries.

Once this process is completed, capture another baseline to ensure that performance problems have been solved and to provide another performance record that you can measure future workloads against.

By following this methodology, you can be proactive in addressing performance problems as they develop.

About the author
Hilary Cotter has been involved in IT for more than 20 years as a Web and database consultant. Microsoft first awarded Cotter the Microsoft SQL Server MVP award in 2001. He is the author of a book on SQL Server transactional replication and is currently working on books on merge replication and Microsoft search technologies.

Dig Deeper on MSP business strategy

Cloud Computing
Data Management
Business Analytics