A-SIS: Deduplication Comes of Age
By Blake Lewis
Everyone knows that the capacity of storage systems is going up at a breathtaking pace. In the last 10 years, NetApp has gone from shipping storage systems with tens of gigabytes to hundreds of terabytes, an astonishing 10,000-fold increase. Most businesses, however, find that their appetite for storage has grown even faster and – in addition to the costs of disk or tape to store all this data – data center space and power are increasingly expensive. Using storage as efficiently as possible is therefore a critical objective.
NetApp has long been an industry leader in efficient storage utilization, from its unique incremental-only Snapshot™ technology, which requires minimal disk space to store hundreds of Snapshot copies, to FlexVol® technology, which enables sysadmins to expand and contract volumes on the fly.
In May, NetApp announced a new deduplication technology that can significantly increase the amount of data stored in a set amount of disk space: Advanced Single Instance Storage (A-SIS) deduplication. This technology is available (at no charge!) for NetApp NearStore® R200 and NearStore on FAS systems.
Deduplication improves efficiency by finding identical blocks of data and replacing them with references to a single shared block. The same block of data can belong to several different files or LUNs, or it can appear repeatedly within the same file. A-SIS deduplication is an integral part of the NetApp WAFL file system, which manages all storage on NetApp FAS systems. As a result, deduplication works "behind the scenes," regardless of what applications you run or how you access the data, and its overhead is low.
How much space can you save? It depends on the data set and the amount of duplication it contains. Here are a couple of examples of the savings that NetApp customers have seen:
- A global oil and gas company achieved a 35% space savings for its home directory storage.
- An investment management company reduced backups copies of their VMware images by 90%.
- A test and measurements manufacturer realized a 98% space savings on daily database backups.
How A-SIS Deduplication Works
At its heart, A-SIS deduplication relies on the time-honored computer science technique of reference counting. Previously, WAFL kept track only of whether a block was free or in use. With A-SIS deduplication, it also keeps track of how many uses there are. In the current implementation, a single WAFL block can be referenced up to 256 times in different files or within the same file. Files don't "know" that they are sharing their data – bookkeeping within WAFL takes care of the details invisibly.
How does WAFL decide that two blocks can be shared? The answer is that for each block, it computes a "fingerprint," which is a hash of the block's data. Two blocks that have the same fingerprint are candidates for sharing.
When A-SIS deduplication is enabled on a volume, it computes a database of fingerprints for all of the in-use blocks in the volume (a process known as "gathering"). Once this initial setup is finished, the volume is ready for deduplication.
To avoid slowing down ordinary file operations, the search for duplicates is done as a separate batch process. As the file system gets updated during normal use, WAFL creates a log describing the changes to its data blocks. This log accumulates until one of the following occurs:
- The administrator issues a sis start command
- The next time specified in the sis config schedule occurs
- The changes to the log exceed a predetermined threshold
Any of these events will trigger the deduplication process. Once the deduplication process is started, A-SIS sorts the log using the fingerprints of the changed blocks as a key, and then merges the sorted list with the fingerprint database file. Whenever the same fingerprint appears in both lists, there are possibly identical blocks that can be collapsed into one. In this case, WAFL can discard one of the blocks and replace it with a reference to the other block. Since the file system is changing all the time, we of course can take this step only if both blocks are really still in use and contain the same data.
The implementation of A-SIS deduplication takes advantage of some special features of WAFL to minimize the cost of deduplication. NetApp discovered a long time ago that to ensure the integrity of data stored on disk, a belt-and-suspenders approach is warranted. (In fact, several pairs of suspenders is best.) Accordingly, every block of data on disk is protected with a checksum.
A-SIS uses this checksum as its fingerprint. Since we were going to compute it anyway, we get it "for free" – there is no additional load on the system. And since WAFL never overwrites a block of data that is in use, fingerprints remain valid until the block gets freed. The tight integration of A-SIS deduplication with WAFL also means that change logging is an efficient operation. The upshot is that A-SIS deduplication can be used with a wide range of workloads, not just for backups, as has been the case with other deduplication implementations.
What Sorts of Environments Are Good Candidates for A-SIS?
In the first place, your data should be fairly long-lived. There isn't much point in working hard to find duplicates if you are going to be changing the data soon. The system should also have some CPU headroom. Change logging and fingerprint matching are designed for efficiency, but nothing is free. If your system spends long periods at high CPU utilization, the extra load that deduplication brings could be the last straw.
Other Approaches for Saving Disk Space
NetApp offers a variety of other alternatives to use disk space more efficiently, each with its pluses and minuses. It isn't necessary to pick just one; for the most part, they can all be used in conjunction.
Snapshot Copies
From the beginning, WAFL has allowed block sharing through Snapshot technology. As a file changes over time, you can capture several versions of it using Snapshot copies, and the storage cost is just equal to the amount of change between versions.
Snapshot copies have proven their value both as a feature in their own right and as the basis for applications such as SnapVault® and SnapMirror®. In WAFL, they come for free as far as performance is concerned. Their main limitation is that they can provide block sharing only between different versions of the same file, unlike A-SIS, which shares duplicate blocks between different files.
Incidentally, if you haven't used NetApp storage before, the NetApp "incremental-only" approach to Snapshot copies is unique among all major storage vendors and is the fundamental technology behind our SnapVault and SnapMirror products, and the main reason for their success.
Compression
Compressing data before it is written to disk is a good way to save space. Algorithms such as gzip can cut the size of a file in half or more, and it works even if there is no duplicated data for sharing. The drawbacks of compression are that it is CPU-intensive. Also, some types of data such as images are already compressed and get no benefit. Because A-SIS deduplication can collapse hundreds of copies of the data into one, it has the potential for much greater savings than compression in environments with lots of duplication.
NetApp currently offers compression in its Decru® and VTL products.
Content-Addressable Storage (CAS)
Although the implementation is usually quite different, content-addressable storage is conceptually similar to A-SIS deduplication. A "blob" of data gets hashed, and the hash value is used to identify it. Only one copy of data with a given hash value is stored. A file can consist of a number of blobs.
In one way, CAS is more flexible than A-SIS deduplication, since CAS blobs do not need to be whole file system blocks. However, in a very important way, CAS is less flexible. With A-SIS deduplication, WAFL can share blocks using fingerprints as keys, but its basic data structures remain unchanged and the sharing is invisible (and of course, you can always turn A-SIS deduplication off). By contrast, in most CAS implementations, blobs are always found through their hash keys. This makes it hard to get good performance, with the result that CAS is generally used for write-mostly archival applications and not for applications that require a quick response to bursts of reads, such as e-discovery and data recovery.
One aspect of CAS that sometimes sparks controversy is that it considers two blobs to be identical if they have the same hash key. If two different blobs happen to hash to the same value, data is lost. This is known as a "hash collision" or a "false positive." There are good statistical arguments for why such an event is highly unlikely, but many people still feel uneasy. A-SIS deduplication takes a conservative approach in this regard, and shares blocks only if their contents (and not just their fingerprints) are identical. Before deleting a block as a duplicate, A-SIS does a byte-by-byte comparison to make sure that the data is indeed the same.
Conclusion
A-SIS deduplication leverages the unique characteristics of WAFL to conserve disk space while keeping system overhead low. In many environments, the space savings can be substantial. Even in primary storage applications, such as a home directory environment, A-SIS deduplication can often produce significant savings.
Just as with NetApp Snapshot technology, the A-SIS deduplication machinery will almost certainly provide the basis for interesting new applications in the future (cloning a file, for instance). It's an exciting development in the ongoing evolution of WAFL.