Tech On Tap :: Insights for Simplifying Data Management | NetApp(R) - Network Appliance
To Tech on Tap Home Click to visit NetApp
TECH ONTAP ARCHIVE - JANUARY 2007 (PDF)
Akshay Bhargava (left) and Mukesh Nigam (right)
Akshay Bhargava and Mukesh Nigam
Product Engineers, Network Appliance
Akshay Bhargava (left) and Mukesh Nigam (right) are members of the NetApp core system team. Akshay lives and breathes Data ONTAP® and the WAFL® file system. He previously served in a variety of engineering management and software development roles working on distributed computing, clustering, storage virtualization, and network security. Mukesh, a former member of the Data ONTAP kernel and networking engineering team, spends his days (and occasional nights) focused on platforms, performance, and networking. Prior to NetApp, Mukesh held a variety of engineering roles working on storage security, networking, UNIX® internals, and fault tolerance.
Monitor, Troubleshoot, and Improve
NetApp Storage Performance
By Akshay Bhargava and Mukesh Nigam

In January 2007, Tech OnTap members received early access to a new 30-page technical report on storage performance management and were invited to submit questions and comments to the authors.

Of all the elements in the IT infrastructure, storage is one of the least understood, often resulting in storage systems that are either underconfigured or overconfigured for their dynamic workloads.

This article provides an overview of the three core elements of storage performance management—monitoring, troubleshooting, and solving performance issues—and how to approach each area in a NetApp environment. It also addresses some of the most common questions asked by NetApp customers, including:

  • What metrics should I use to track NetApp system performance?
  • How do I troubleshoot storage performance problems?
  • And, most importantly, how can I correct performance issues?

Performance Monitoring:
Latency Is an Early Indicator of Performance Problems

A little proactive effort can yield big rewards, especially in performance management. The easiest way to know when you’ve got a performance problem is to understand acceptable performance metrics for your applications, establish baselines for key performance metrics, and monitor your environment regularly.

The full set of metrics you choose to monitor may depend on your application, but NetApp recommends monitoring latency as the primary performance indicator.

Latency is a highly reliable indicator of changes in performance and is often one of the first indicators of a resource overload. Other parameters such as throughput and transactions per second are important, but regularly monitoring latency can help you easily detect potential storage bottlenecks before they become a problem. Latency increases that may not yet be large enough to noticeably affect end users or applications often indicate that a storage system is approaching a resource bottleneck.

Command-line tools such as stats show can be used to determine latency, as can applications like Windows® Perfmon or other standard third-party tools that integrate with the Manage ONTAP™ API.

NetApp recommends the Performance Advisor Client provided with NetApp Operations Manager (formerly known as DataFabric® Manager, or DFM). The Performance Advisor Client can provide a convenient, graphical view of latency—or virtually any available performance metric—over time. If you use Operations Manager, we suggest creating a graphical view of latency for each critical volume on your storage system and saving a baseline for each volume with acceptable application performance. (See Figure 1 for an example.) In addition to latencies, you should also establish baselines for CPU utilization and throughput on each network interface (including NICs and/or HBAs).

Figure 1) Volume latency
Figure 1) Volume latency.

For example, if a storage system is servicing a critical database plus other noncritical workloads, you should establish a baseline and monitor latency regularly on every database volume. If you notice latency gradually increasing over time and approaching the threshold you deemed acceptable, you should troubleshoot and take corrective action before the problem becomes critical.

Even if an end user reports a problem that was not detected through regular storage system monitoring, having good baselines still provides a basis for comparison that can simplify troubleshooting as you try to determine where the bottleneck is in your IT infrastructure.

Troubleshooting:
Rule Out Transient Activities, Then Focus on Utilization

When latency increases, the first step to isolating the cause is to look for transient system activities that might be causing a resource overload. Transient activities such as RAID reconstructions, SnapMirror® transfers, NDMP, and others may affect performance.

If one or more such activities are present:

  • See if turning the activities off or throttling them improves performance. (Be sure that turning off or throttling the activities won’t adversely affect critical operations.)
  • For one-time or infrequent activities such as RAID reconstruction, consider the tradeoffs. It might be better to live with the performance impact.
  • For regular activities such as Snapshot™, SnapMirror, NDMP, and SnapVault®, consider altering the schedule so that the activity occurs when the system is less loaded.

Once you rule out transient activities, the next step is to drill into each system resource, comparing current utilization levels against your established baselines. We suggest following this order: CPU, disk, networking when drilling into resources. Even if a potential source of a problem has been found, it’s a good idea to continue through the other resources to make sure nothing is overlooked.

Keep in mind, of course, that the storage subsystem is only one component in the IT infrastructure that can impact user/application latency. Other factors, such as network, host OS, or application performance, can also impact performance and are important to troubleshoot in isolating a performance issue.

Let’s take a look at an example to see how this methodology works.

Case Study: Troubleshooting a CPU Bottleneck

Here’s a troubleshooting example based on a workload that generates sequential, read-intensive I/O. The following baseline data was created using the Performance Advisor Client:

Figure 2) Baseline volume latency and CPU utilization.
Figure 2) Baseline volume latency and CPU utilization.

Increasing the number of clients created an overload. Note that both read latency and write latency more than double from the baseline.

Figure 3) Volume latency and CPU utilization with an increased number of clients.
Figure 3) Volume latency and CPU utilization with an increased number of clients.

Following the methodology just discussed, the first step is to check for transient activities. None were found, so we moved on to CPU and quickly discovered that the CPU was pegged. Typically you want to keep CPU utilization under 90%. This example consistently exceeds that threshold. Notice the substantial change from the CPU baseline as well.

Don’t forget that CPU overload may be a symptom, not a cause. Some other overloaded system resource could trigger high CPU utilization. For this reason, in this example we went on to check disk and network utilization. The sysstat command showed that total disk I/O utilization was 30%, indicating that disk I/O was unlikely to be a problem. Similarly, network utilization on all NICs and HBAs was well below the limits of the installed hardware, and there were no errors, confirming that no other resource overload existed and concluding that the CPU was the source of the problem in this case.

Correcting Performance Problems: Upgrading Isn’t the Only Option

Customers often think that if the problem involves CPU overload, their only option is to upgrade to a more capable storage system. There are a number of less drastic corrective actions to consider first:

  • Utilize NetApp FlexShare (Data ONTAP 7.2 and higher) to give higher priority to critical workloads and low priority to nonessential workloads
  • Spread the workload across multiple storage systems to better balance the load
  • Stagger jobs during the course of the day

Upgrading should only be necessary if these options are not feasible at your site. Similar corrective actions exist for other common types of overload.

The Goal: Reliable, Predictable Operations

When it comes to storage system performance, achieving reliable, predictable operation should be your primary goal. Establishing baselines and thresholds for each application and instituting a regular monitoring program are the best things you can do to achieve predictability. With those in place, you can detect performance issues before they become critical, quickly determine whether or not storage is the source of a user-reported performance problem, and troubleshoot any problems that might arise.

Sneak Preview

Report Excerpt: Storage Performance Management Best Practices Table of Contents. To read more visit http://www.netapp.com/news/techontap/Storage_Perf_Mgmt.html

For detailed information on storage performance monitoring and troubleshooting in NetApp environments, check out Storage Performance Management. This report was originally available only to Tech OnTap members.

Click here to read the full report.


 

RELATED INFORMATION

Performance Metrics in Data ONTAP

Data ONTAP has always maintained a wide variety of performance counters for monitoring and troubleshooting. To make these counters more accessible, NetApp provides a Counter Manager layer within Data ONTAP. Counter Manager can be easily queried by Manage ONTAP.

Netapp Counter Manager
Click to Enlarge


Manage ONTAP is a collection of application programming interfaces (APIs) for the Data ONTAP operating system. These APIs are used by Operations Manager and AutoSupport and also provide open access to NetApp performance metrics for integration between NetApp solutions and partner solutions, as well as simplified integration with in-house applications.

Manage ONTAP is exposed within Data ONTAP through a variety of interfaces, including SNMP, CLI, RPC, NDMP, and Data ONTAP APIs, so it is relatively easy for your in-house application to monitor important storage metrics.

The Windows Perfmon tool built into Microsoft® Windows can also be used to monitor NetApp performance metrics through calls to the Windows Perfmon support module of Data ONTAP.

FlexShare in Action

The following graph shows how a high-priority application can benefit from FlexShare:

Priority On vs. Priority Off
Click to launch the FlexShare demo.


To perform this test:

  • A single NetApp system was configured with two identical volumes.
  • SIO (see related article) was used to simulate an OLTP workload against each volume (R/W 40:60, random: 70%, block size: 8KB).

With FlexShare off, each volume delivered a workload of about 6,000 disk ops per second for a total of 12,000. Priority on one volume was then set to "VeryHigh," while the other was set to "VeryLow."

  • The high-priority volume climbed to an average of around 9,000 ops, 50% more than without FlexShare and 3X the low-priority volume.
  • The low-priority volume dropped to an average of about 3,000 ops.

Total operations remained at 12,000, indicating that FlexShare has no significant system overhead.

Read the FlexShare Design and Implementation Guide. (pdf)

Storage Monitoring Tools

Having the right tools can dramatically simplify the process of monitoring your storage systems. NetApp Operations Manager is perfectly suited to aid capacity monitoring and planning. A guided tour is available.

There are also a variety of free applications and scripts that many customers find helpful. Tools recommended by Tech OnTap members include:

SUBSCRIBE  | UNSUBSCRIBE | PROVIDE FEEDBACK  | PRINT PDF