How Exchange Server 2007 Impacts
Backup, Recovery and DR Strategies
By Robert Quimbey
The introduction of Microsoft Exchange Server 2007 is raising a lot of questions in the minds of Tech OnTap readers. This was made clear in a recent TechTalk Exchange live chat session, where questions about 2007 were prominent. From a storage perspective, there are several new features in Exchange Server 2007 that will affect some storage environments. The most significant ones are:
- Local Continuous Replication (LCR)
- Cluster Continuous Replication (CCR)
- Changes in backend I/O traffic to storage
The introduction of these features raises a number of questions:
- How will Exchange Server 2007 affect my backup strategy?
- How will Exchange Server 2007 affect my disaster recovery (DR) strategy?
- Which storage networking protocol should I use? FCP or iSCSI?
This article answers the questions listed above. To learn more about the other new features of Exchange Server 2007, check out the Microsoft Exchange Server 2007 Web site.
Question #1: Does LCR Replace Backup?
Answer: Definitely Not!
Local Continuous Replication replicates Exchange databases to another set of disks on the same physical server. The question that this immediately raises is, "Does that mean I don´t need to do backups any more?" The resounding answer to that question is "No!" Regular backups must still be performed.
The purpose of LCR is not backup; it is high availability (HA). LCR creates a copy of the Exchange database, which provides two sets of the same data. The LCR copy is just slightly behind the primary. Data is written to the primary Exchange log file first, and then, slightly later, that log file is replicated to the LCR copy or target. The trigger for replication is the closing of the log file. The log file is 1MB in size, so after 1MB is written to the primary, it is replicated to the LCR target, and then played into the target database.
If something goes wrong with the primary data store, there is another copy available for use (although slightly behind), but that copy isn´t a replacement for backups! Suppose there is a logical corruption in the primary database. As soon as the log fills, the secondary database copy becomes corrupted as well. Similarly, if something is deleted from the primary database, a short time later it is deleted from the secondary. The deleted item is stored in the database (dumpster) by default for 14 days before being expunged during online maintenance. To recover deleted data more than 2 weeks old, a backup is required.
Be aware that LCR does not address database verification. Only incoming log files are inspected. Performing regular backups with verification–such as those performed with NetApp SnapManager® for Exchange– can verify that the production database is healthy. This ensures the success of the backup and the ability to use a consistent database (and log file) image for restoration when required.
Question #2: Does CCR Replace My DR Strategy?
Answer: It Might, But There are Tradeoffs.
As previously discussed, since LCR is focused on storage resiliency, there is no benefit if the server goes down. Instead, if the primary storage fails, the failover is not automated. This means that the administrator has to manually fail over to the target copy. That brings us to Cluster Continuous Replication, which provides Exchange server resiliency. Instead of keeping a copy of the Exchange database on the same machine, the copy is kept on another machine.
The second machine that stores the Exchange database copy is deployed as part of a two-node Microsoft Cluster (MSCS) that includes the original machine. (Right now, only a two-node cluster is supported.) To be in an MSCS cluster, network latency must be below 500ms (heartbeat) to ensure that the cluster nodes can communicate with each other. Next, with CCR, latency and throughput must keep up with the log generation. This is bandwidth-dependent, so the infrastructure may support a distance of anywhere from 1 to 200 miles. Typically, this translates to 60 to 100 miles.
If the primary cluster node fails, the system automatically uses the secondary node running against the replica of the Exchange database. In a controlled failover, where the primary node is still available, all log files are copied to the target, and no data is at risk. In the event of a catastrophic failure of the primary node, CCR (not LCR) attempts to recover all mail sent through transport from the hub transport server, which may not have been replicated at the time of failure.
Messages that had been applied to the primary Exchange database, whose log files had not been replicated, are still potentially available from the transport dumpster on the hub transport servers in the infrastructure. Assuming that the hub transport server was configured properly and didn´t also fail when the primary CCR node went down (think site disaster), this data can be recovered. (Note: More I/O per message will be required if transport dumpster is enabled on the Hub Transport server; see Capacity and Transaction IO Requirements for Exchange 2007 Edge Transport and Hub Transport servers).
The storage infrastructure question raised by CCR is, "Does CCR eliminate the need for disaster recovery mechanisms such as NetApp SnapMirror®?" The answer is – maybe. It depends on a number of things:
- Recovery objectives. How much loss of e-mail data is acceptable? Remember that SnapMirror replications are triggered as a result of a SnapManager for Exchange backup, while CCR replication occurs as a result of a filled 1MB log file.
- Ability to support the extra I/O from CCR. CCR imposes additional I/O overhead to that of normal Exchange 2007 I/O. Microsoft recommends that CCR clusters use isolated storage from other servers; see Continuous Replication LUNs. In an isolated environment, the additional I/O overhead of the target LUN would keep up with the source and would not require additional disk performance, although at higher latencies. CCR clusters that use shared storage (shared between clusters) require that the storage (both source and target) be over-provisioned to handle 100% more I/O than the source LUN for CCR deployments. For example, if the source LUN requires 1,000 I/Os, in a shared storage environment the source and target LUNs would need to be provisioned with storage to handle 2000 I/Os each.
- Distance to the disaster recovery site. Remember that Microsoft Geographically Dispersed Clusters require network latency below 500ms, and Exchange 2007 requires a disk latency below 20ms, which typically restricts distances less than 100 miles between systems in the cluster. If the DR site is more than 100 miles away, then a solution other than or in addition to CCR is required.
Some NetApp customers are considering running both CCR and SnapMirror. These IT teams plan to use CCR in the local area for high availability as well as offloading the backup and verification activities, and SnapMirror to replicate Exchange data to a remote disaster recovery site. (An unfortunate lesson of Hurricane Katrina is that it´s probably a good idea to have the DR site more than 100 miles away if at all possible.)
The key takeaway is to deploy those infrastructure components that make the most sense to meet the availability needs. NetApp software products, including SnapManager for Exchange 4.0 and Single Mailbox Recovery 4.2, include support for Exchange Server 2007 today and integrate with both LCR and CCR to make this support easier. For example, running CCR to replicate Exchange data to a secondary node and running SnapManager for Exchange on the secondary node to do the backup and verification tasks ensure that these tasks don´t affect the primary node.
Question #3: Should I Use FC SAN or iSCSI for Exchange Server 2007?
Answer: It Depends on the Infrastructure, but iSCSI is Usually Appropriate.
The final question that a lot of IT teams are asking involves which protocol is best suited to supporting an Exchange infrastructure. This isn´t strictly limited to Exchange Server 2007, but some of the changes in Exchange Server 2007 should increase the comfort level with iSCSI.
The first thing that concerns some people about iSCSI is that it typically runs at only 1GB/sec, while Fibre Channel runs at 2 or 4GB/sec. Further, Fibre Channel networks are predictable with regard to bandwidth consumption and network access. The typical Ethernet network deployed for iSCSI is not predictable in this way due to collision avoidance. There´s less chance to send data on a busy network, but the way people normally implement these networks–with a dedicated connection–it´s not really an issue.
Both Microsoft and NetApp best practices for deploying iSCSI involve using dedicated connections.
The second thing about iSCSI versus Fibre Channel is that on FCP a lot of the protocol handling is done on specialized hardware, while most people running iSCSI tend to use the Microsoft iSCSI software initiator. This processing consumes CPU cycles on the server to process the protocol, and those cycles may be important for other activities on the server.
There are, however, iSCSI HBAs that offload iSCSI protocol processing from server CPUs in a manner similar to Fibre Channel HBAs. People usually choose to use these for one of two reasons:
- Booting over iSCSI typically requires an iSCSI HBA. (The network protocols need to be resident so the server can talk to storage to get the server boot image.) The iSCSI HBA looks to the server just like any other disk controller.
- The server CPU cycles that iSCSI protocol processing consumes are needed for something else. An iSCSI HBA doesn´t necessarily provide better network performance, but it does give back CPU cycles for running more Exchange users. With newer, faster servers with 64-bit CPUs, multiple cores, and back-end buses, there is typically more than enough CPU power to do both jobs, so this reason is becoming rarer and rarer. Keep in mind, an iSCSI HBA is needed, not a TOE (TCP/IP Offload Engine) card, which offloads only TCP/IP processing and still leaves iSCSI processing to the host.
Still, some people voice concerns about using iSCSI because it does not provide the raw performance of Fibre Channel. For small-block I/O applications, however, the critical factor is IOPS with low latency, not bandwidth. Nick Triantos explored this topic in detail in Choosing between iSCSI and Fibre Channel SANs.
Testing has shown that bandwidth doesn´t really matter with transactional I/O in Exchange because a single machine can´t generate enough load to flood a 1GbE pipe, and that pipe is dedicated between the server and storage. It´s possible that if there are a lot of servers sharing a single 1GbE/sec connection to storage (possibly a bad idea), the connection could become saturated, so that could become a bottleneck. However, even in this case it is possible to add more 1GbE connections to increase the bandwidth to the storage device.
Additionally, with Exchange Server 2007, the actual amount of disk I/O is significantly decreased over prior versions of Exchange. This change in I/O is due to a couple of factors: on 64-bit hardware, additional memory is available to use for database caching, thus reducing I/O; and changes to the Exchange database and the internal I/O activities of Exchange further reduce I/O.
A 1GbE pipe can run into trouble with non-transactional I/O. Streaming online backups, VSS integrity checks, and offline repair or defrags all depend on disk throughput. For these activities, if 1GbE does not meet required SLAs (for backup, restore, or repair), more HBAs or network cards should be used with MPIO to increase throughput. Microsoft v2 iSCSI initiator supports MPIO and can be found here.
NetApp, of course, supports both protocols so feel free to choose whichever option makes the most sense in your storage infrastructure.
Exchange Server 2007 and Your Storage Infrastructure.
The main thing to remember about Exchange Server 2007 is that although LCR and CCR are valuable additions to the overall toolset and can greatly increase Exchange availability, there is still an absolute requirement for Exchange backups. In addition to basic data protection, many companies have internal requirements to maintain off-site backups, and some government agencies and legal regulations require it as well. Exchange database verification is another critical component to a healthy messaging environment that occurs during both VSS and streaming online backup.
Additionally, although the new features associated with Exchange Server 2007 help provide high availability, HA alone can´t ensure that Exchange is always up. A complete disaster recovery plan is still required and, depending on requirements, a mirroring solution may be required. Ultimately, these decisions impacting the length of downtime, will be influenced by the level of acceptable risk, and the amount of money available for the solution.
Comment on this article