Monitor, Troubleshoot, and Improve
NetApp Storage Performance
By Akshay Bhargava and Mukesh Nigam
In January 2007, Tech OnTap members received early access to a new 30-page technical report on storage performance management and were invited to submit questions and comments to the authors.
Of all the elements in the IT infrastructure, storage is one of the least understood, often resulting in storage systems that are either underconfigured or overconfigured for their dynamic workloads.
This article provides an overview of the three core elements of storage performance managementmonitoring, troubleshooting, and solving performance issuesand how to approach each area in a NetApp environment. It also addresses some of the most common questions asked by NetApp customers, including:
- What metrics should I use to track NetApp system performance?
- How do I troubleshoot storage performance problems?
- And, most importantly, how can I correct performance issues?
Performance Monitoring:
Latency Is an Early Indicator of Performance Problems
A little proactive effort can yield big rewards, especially in performance management. The easiest way to know when you’ve got a performance problem is to understand acceptable performance metrics for your applications, establish baselines for key performance metrics, and monitor your environment regularly.
The full set of metrics you choose to monitor may depend on your application, but NetApp recommends monitoring latency as the primary performance indicator.
Latency is a highly reliable indicator of changes in performance and is often one of the first indicators of a resource overload. Other parameters such as throughput and transactions per second are important, but regularly monitoring latency can help you easily detect potential storage bottlenecks before they become a problem. Latency increases that may not yet be large enough to noticeably affect end users or applications often indicate that a storage system is approaching a resource bottleneck.
Command-line tools such as
stats show can be used to determine latency, as can applications like Windows® Perfmon or other standard third-party tools that integrate with the Manage ONTAP™ API.
NetApp recommends the Performance Advisor Client provided with NetApp Operations Manager (formerly known as DataFabric® Manager, or DFM). The Performance Advisor Client can provide a convenient, graphical view of latency—or virtually any available performance metric—over time. If you use Operations Manager, we suggest creating a graphical view of latency for each critical volume on your storage system and saving a baseline for each volume with acceptable application performance. (See Figure 1 for an example.) In addition to latencies, you should also establish baselines for CPU utilization and throughput on each network interface (including NICs and/or HBAs).
Figure 1) Volume latency.
For example, if a storage system is servicing a critical database plus other noncritical workloads, you should establish a baseline and monitor latency regularly on every database volume. If you notice latency gradually increasing over time and approaching the threshold you deemed acceptable, you should troubleshoot and take corrective action before the problem becomes critical.
Even if an end user reports a problem that was not detected through regular storage system monitoring, having good baselines still provides a basis for comparison that can simplify troubleshooting as you try to determine where the bottleneck is in your IT infrastructure.
Troubleshooting:
Rule Out Transient Activities, Then Focus on Utilization
When latency increases, the first step to isolating the cause is to look for transient system activities that might be causing a resource overload. Transient activities such as RAID reconstructions, SnapMirror® transfers, NDMP, and others may affect performance.
If one or more such activities are present:
- See if turning the activities off or throttling them improves performance. (Be sure that turning off or throttling the activities won’t adversely affect critical operations.)
- For one-time or infrequent activities such as RAID reconstruction, consider the tradeoffs. It might be better to live with the performance impact.
- For regular activities such as Snapshot™, SnapMirror, NDMP, and SnapVault®, consider altering the schedule so that the activity occurs when the system is less loaded.
Once you rule out transient activities, the next step is to drill into each system resource, comparing current utilization levels against your established baselines. We suggest following this order: CPU, disk, networking when drilling into resources. Even if a potential source of a problem has been found, it’s a good idea to continue through the other resources to make sure nothing is overlooked.
Keep in mind, of course, that the storage subsystem is only one component in the IT infrastructure that can impact user/application latency. Other factors, such as network, host OS, or application performance, can also impact performance and are important to troubleshoot in isolating a performance issue.
Let’s take a look at an example to see how this methodology works.
Case Study: Troubleshooting a CPU Bottleneck
Here’s a troubleshooting example based on a workload that generates sequential, read-intensive I/O. The following baseline data was created using the Performance Advisor Client:
Figure 2) Baseline volume latency and CPU utilization.
Increasing the number of clients created an overload. Note that both read latency and write latency more than double from the baseline.
Figure 3) Volume latency and CPU utilization with an increased number of clients.
Following the methodology just discussed, the first step is to check for transient activities. None were found, so we moved on to CPU and quickly discovered that the CPU was pegged. Typically you want to keep CPU utilization under 90%. This example consistently exceeds that threshold. Notice the substantial change from the CPU baseline as well.
Don’t forget that CPU overload may be a symptom, not a cause. Some other overloaded system resource could trigger high CPU utilization. For this reason, in this example we went on to check disk and network utilization. The sysstat command showed that total disk I/O utilization was 30%, indicating that disk I/O was unlikely to be a problem. Similarly, network utilization on all NICs and HBAs was well below the limits of the installed hardware, and there were no errors, confirming that no other resource overload existed and concluding that the CPU was the source of the problem in this case.
Correcting Performance Problems: Upgrading Isn’t the Only Option
Customers often think that if the problem involves CPU overload, their only option is to upgrade to a more capable storage system. There are a number of less drastic corrective actions to consider first:
- Utilize NetApp FlexShare (Data ONTAP 7.2 and higher) to give higher priority to critical workloads and low priority to nonessential workloads
- Spread the workload across multiple storage systems to better balance the load
- Stagger jobs during the course of the day
Upgrading should only be necessary if these options are not feasible at your site. Similar corrective actions exist for other common types of overload.
The Goal: Reliable, Predictable Operations
When it comes to storage system performance, achieving reliable, predictable operation should be your primary goal. Establishing baselines and thresholds for each application and instituting a regular monitoring program are the best things you can do to achieve predictability. With those in place, you can detect performance issues before they become critical, quickly determine whether or not storage is the source of a user-reported performance problem, and troubleshoot any problems that might arise.
Sneak Preview
For detailed information on storage performance monitoring and troubleshooting in NetApp environments, check out Storage Performance Management. This report was originally available only to Tech OnTap members.
Click here to read the full report.