Six Tips for Archive and
Compliance Planning
by Mike Riley
What would you do if someone slid a 5¼" disk across the table and asked you to open up the WordStar document on it? I realize I'm addressing the Tech OnTap audience, but let's face it: those of us who even remember WordStar are an ever-shrinking minority. The irony is that WordStar only breathed its last gasps as a commercial technology in 1998 or so—barely eight years ago.
The point is that there's a lot more to long-term data storage than just squirreling away data. You've got to think about how the data is stored, what format the data is stored in, and what it will take to read it back. Time works against you when you're talking about storing data for three, five, seven, 30, or more years.
Don't make decisions now that limit your future options.
I've spoken with my fair share of NetApp customers about archive and compliance over the past seven years, and the main advice I give them is to stay nimble. "Be nimble" ranks right up there with "don't worry; be happy" when it comes to technical advice, but the point is that you don't want to make decisions now that lock you into proprietary technologies or limit your future options.
Here are six things to keep in mind as you design an infrastructure to cope with long-term data storage for archive and compliance.
Tip 1: Avoid Storage Silos
Long-term archival is largely driven by rapidly growing volumes of e-mail and unstructured files, while compliance must satisfy external regulatory requirements and enable litigation support and e-discovery. Whether you're dealing with archive or compliance planning, many of the storage infrastructure requirements are the same.
In fact, from a storage perspective there are really only two critical differentiators:
- Archive is driven by a need for mailbox management. This typically involves leveraging tiered storage to move e-mail from the primary e-mail server storage to lower-cost secondary storage.
- Compliance adds an absolute requirement for data permanence that can cast every misstep, every wrong decision, and every virus in stone for years to come.
Because the underlying storage requirements are similar, wherever possible you should try to meet long-term storage requirements with the same basic infrastructure. Deploying separate silos of storage for archive and compliance distinct from your enterprise storage adds management complexity and limits your ability to adapt flexibly to future needs.
Tip 2: Avoid Proprietary Data Formats and Technologies
Think about that WordStar document again for a minute. It's bad enough that it was written by an application that hasn't been active for at least eight years, but what about that 5¼" floppy? What format is that in? Where can you find a drive that can read it? How can you get the file onto some medium where you can even begin to work with it?
I'm not suggesting that the 5¼" floppy was ever considered a good medium for archival storage, but it's far from the only storage device that ever passed quickly from the mainstream into obsolescence. When you evaluate storage solutions for archive and compliance you need to focus on open, standards-based solutions.
Here are a few critical questions to ask:
- What format is used to store the data? You want your file data to be stored in its native format as much as possible. The more software layers that separate you from your data, the more trouble you could have in the future.
- How difficult will it be to migrate to new storage? Let's face it; if you're going to store data for years, you're eventually going to have to migrate to new storage. If your data is stored in a proprietary format, it may make future migration more complicated.
- What access protocols are supported? If you can only access your data using one protocol, you're locked in. You want the option of using multiple standard protocols.
- What archive and compliance applications are supported, and how are they integrated? A good long-term solution must support the major applications for e-mail archival, compliance, and so on. Beyond that, how are the applications integrated? Does the storage system have proprietary APIs that the application must follow, or does it use standard access protocols? Proprietary APIs make it harder for application vendors to support a particular platform.
The more rigid your solution, the more time becomes your enemy.
In an enterprise setting where by law you must lock down data for years, the best way to preserve flexibility is to keep the data in its original format as much as possible and make it accessible using standard protocols. As much as security architectures such as Active Directory have evolved over the past five years, I also suggest that the compliance solution slipstream natively into that security model. This gives you file-level control at the individual user level. The more rigid your solution, the more time becomes your enemy.
At this point you may be thinking, "Of course NetApp would say this is the best solution because it's what NetApp offers." In any evaluation, you want to pull the vendor tag off of the solution and look at the data. We're not reading tea leaves, nor are we splitting the atom. Eliminating the vendor religion from the discussion, the idea in archive and compliance planning is to meet (or exceed) regulatory/corporate governance demands and yet prepare yourself for the fact that laws can change, software evolves (or becomes extinct), and sometimes you change your mind. You want to be able to adapt without radically changing your architecture. If you work from there, the solutions start to fall into place.
Tip 3: Don't Consciously Limit Your Options
There's one thing I can tell you about any technology forecast: it's wrong. How wrong it is will determine the price you pay for following it. When you're dealing with data that must be kept for years, limiting your options from the start is the wrong strategy. Based on your environment and requirements, you might decide not to implement certain features, but that's different than closing doors or leaving off features entirely.
You know what you expect of your storage solutions today—how to back them up; how to replicate them; how to secure and protect them; how they should perform in different business critical situations. How about next year? What about 10 years from now?
It's easy to rationalize options away, but it's extremely difficult to bolt on features after the fact.
Another consideration involves the resiliency options you use today and what you might need in the future. If a solution includes ATA disks (which are considered less reliable than Fibre Channel disks), will standard RAID be enough, or is double-parity RAID a necessity? If you need double-parity RAID, can the solution you choose deliver the performance you need now and in the future?
It's easy to rationalize options away, but it's extremely difficult to bolt on features after the fact. Search tools and SRM reporting can be added on quickly with minimal pain, but you may not be so lucky with other features. Don't be too ready to give up the features you've gotten used to in your existing storage environment. A solution that can't be backed up using your current enterprise backup strategy may not make much sense. The same goes for ACLs and directory services.
Tip 4: Plan for Performance
I think it's fair to say that no one in the storage industry thought that performance would be a big factor in disk-based archive and compliance storage. IT departments were mainly replacing tape and optical solutions, so any disk-based solution would be a marked improvement, right? The main consideration was providing inexpensive but reliable storage.
Then the market changed. There was a geometric increase in the number of audit and discovery requests (facilitated by all this data being on a random access device), and the scope expanded beyond e-mail to include home directories, instant messages, digital images, phone records, database records—you name it. Ingestion rates and queries skyrocketed. In effect, we had changed our tape and optical replacement strategy into a data warehouse. Who knew?!
For e-mail archival you have to guarantee performance for a number of simultaneous activities. End users need adequate performance when browsing archived e-mail even when a discovery search is under way. Discovery searches or other activities can't get in the way of ongoing archive processes, and when it comes time to expire and delete messages, the storage system has to support rapid deletion (potentially hundreds of thousands of messages a day) along with everything else.
You can put this under the "I'd rather be lucky than good" category, but having the ability to mix drive types and move up to more powerful controllers for greater horsepower worked in the favor of NetApp customers. Today, we have a number of customers turning to us as an alternative simply because the market changed, performance became an issue, and they needed a new strategy.
Tip 5: Utilize Storage Virtualization
A technology that has gained a lot of interest recently is storage virtualization. The appeal is clear—what's not to like about the idea of having a storage pool of infinite bigness?
When it comes to archive and compliance, look for virtualization technologies that offer you the best of both worlds: a pool of infinite bigness and the granular control of file systems.
Archival is worthless; discovery is priceless.
Having a large storage pool certainly simplifies an archival architecture. This type of storage virtualization makes it easy to add additional storage where it's needed and adjust the size of volumes as necessary to adapt to changing policies. For instance, you may decide that you need to retain archived e-mail for a longer period and thus increase the size of volumes dedicated for that purpose.
But—as the saying goes—"archival is worthless; discovery is priceless." Having granular access and control of the data is critical because the meter starts running when a discovery request is issued. If external auditors and lawyers want to see results by the end of the week, your lawyers will want to see that information tomorrow. Would you rather run your initial search through 100TB or 10PB?
Having a granular look at the data can also provide flexibility if you need to streamline backup operations or balance replication schedules. Rather than attempting to cope with one massive "junk drawer" of information, you can focus on levels of granularity that make sense.
Tip 6: Think Carefully about Encryption
FBI surveys suggest that 50–80% of attacks come from inside the firewall.
Recent compromises of credit card and other personal data reported in the news media have made companies keenly aware of the need to protect data from prying eyes. FBI surveys suggest that 50–80% of attacks come from inside the firewall, making encryption the best option to secure valuable trade secrets, financial data, and customer records. Even where compliance is not mandated, companies are discovering the importance of encryption. Most industry experts agree it's likely to become a long-term trend for e-mail and e-mail archives because of the intellectual property and confidential information contained within.
Because security is particularly important for compliance data, many customers want to leverage encryption there. One potential problem with encryption is that, depending on the encryption used, identical blocks of input data (plaintext) may encrypt to different blocks of output data (ciphertext). If you have two files that are exactly the same going in, they'll be different coming out. This can significantly impact single instance storage strategies and potentially increase storage requirements.
Also, if you're going to encrypt data, it's better to do it before it enters your system. Encryption is typically run on a third-party server or appliance such as a Decru® system, so you'd think it would be pretty easy to add after the fact, but you have to think twice before you encrypt data that has already entered into compliance.
Here's the issue: encryption changes the data. If you have unencrypted data already in compliant storage, you can't change it. If you encrypt it, you will need twice the storage and a certified audit to confirm that the data itself did not change during the encryption process (also known as a chain of custody). Some archival and compliance software packages do include an encryption option, but the tradeoff tends to be an impact on the archival system performance or increased server requirements.
In a Nutshell: Be Nimble
The only thing certain in archive and compliance right now is that things are going to change further. Although you can't predict every possibility or circumstance, if you utilize open data formats and architectures and avoid limiting your options unnecessarily, you'll be ahead of the game when it comes to deploying a nimble compliance solution that can adapt readily to change.