Zone-level deduplication represents a first in the disk-based backup with deduplication market. It represents the first truly scalable deduplication architecture that includes both generic and content-aware methods of data deduplication.
Prior to ExaGrid's entry into the market, first generation disk-based backup appliances utilized generic block level algorithms to perform deduplication. Due to the need for blocks to exactly match other blocks to achieve deduplication, these implementations cut data into very small objects sizes such as 8k or 16k. As a result, they suffer from significant scaling limitations due to the resulting size of their tracking tables. For example, a product using an 8k object size would generate 1.25 billion objects to track when storing as little as 10TB of data. This prevents vendors from distributing these tracking tables across multiple servers. The block-level method limits vendors to systems that employ single servers (called controllers) plus disk shelves, or small fixed size appliances with no expansion.
In some instances, vendors, especially those that embed deduplication in backup applications, chose larger object sizes to combat this problem. However, the result was a minimal improvement in hash table management and a far lower deduplication result (typically 10 to 1 versus the 20 to 1 found with smaller objects).
Truly, there was a fork in the road with deduplicating products, and customers had to choose one or the other path.
Byte-Level vs. Block-Level Deduplication Technology
ExaGrid entered the market with its highly scalable byte-level deduplication algorithm. Because byte-level delta can deduplicate by finding objects which are similar to other objects versus requiring exact matches, the algorithm can leverage much larger object sizes, typically 8mb to 50mb in size. This results in a few thousand objects per 10 TB of primary data versus billions. This significant architectural difference is what allows ExaGrid to provide a highly scalable, GRID based architecture comprised of multiple server nodes versus a single controller and disk shelves found in the block level products.
One other benefit (and limitation) of early byte-level deduplication implementations is that it used a content-aware approach to deduplicating data. As a result, ExaGrid's disk backup appliance can provide very unique reports to customers that include elements of the data itself. However, this also requires additional knowledge about the data over generic methods and limits the number of applications and use cases that ExaGrid has historically supported.
While the block-level and byte-level methods both deliver on the promise of disk space and bandwidth savings, each required a trade-off: flexibility of block versus scalability of byte.
The obvious right answer would be to create a generic method based on a byte-level foundation which preserves the inherent scalability of byte-level deduplication. That is exactly what ExaGrid has done.
The Introduction of Zone-level deduplication
With the addition of an advanced, generic mode of byte level delta, ExaGrid has produced the first truly generic method for using the scalable byte level delta algorithm. With the advent of zone-level deduplication, ExaGrid is the first and only vendor to:
- Produce a generic deduplication algorithm that is scalable.
- Provide customers with the option to choose between a generic and content aware dedupe mode in the same appliance.
- Extend use of the byte-level algorithm more broadly into backup, archive, and near-line applications.
Zone-level deduplication employs an advanced technique to identify objects within a body of data that are similar to other objects previously seen. As a result, it creates targeted opportunities to deduplicate these similar objects against those previously seen. A key aspect of the generic mode within zone-level deduplication is that it does this without any knowledge of the format or make up of the data on which it is operating. Therefore, zone-level deduplication includes a truly generic way of identifying deduplication opportunities.
But perhaps most important is that zone-level deduplication combines generic and content-aware deduplication into a highly scalable GRID-based architecture. This avoids the significant scalability problems found in all of the products that employ the block-level methods. The scalability challenges with block-level architectures include:
- Costly controller components that require customers to purchase more power than they need up front at a great expense versus buying just the power that they need.
- Increase in backup window, restore time, and processing time due to data growth as systems are expanded by just adding capacity, versus adding network ports, cpu, and memory as well.
- Expensive fork-lift upgrade points when a controller is out-grown. When maxed out, even the smallest increase in capacity requires the purchase of a more expensive controller.
- In the case of small, fixed appliances, the customer is forced to manage appliance sprawl. Each appliance is its own isolated island of storage that has to be managed separately.
Zone-level deduplication finally delivers both the scalability found in byte-level deduplication and the flexibility offered by block-level methods.
Scalable GRID Architecture
Any of the appliance models can be mixed and matched into a single GRID configuration of up to 480TB raw capacity and allowing full backups up to 210TB. These core ExaGrid disk-based backup appliances include GRID computing software which allows them to virtualize into one another when plugged into a switch. As a result, once virtualized, they appear as a single pool of long-term capacity. Capacity load balancing of all data across servers is automatic, and multiple GRID systems can be combined for additional capacity. Even though data is load balanced, deduplication occurs across the systems so that data migration does not cause a loss of effectiveness in deduplication.
ExaGrid's unique approach to scalability provides the following benefits:
- Performance is maintained as your data grows - each additional ExaGrid server added to a system provides disk, processor, memory and bandwidth
- Plug-and-play expansion - adding an additional ExaGrid server is as simple as plugging it in and letting ExaGrid's automatic virtualized GRID software do the rest
- Cost-Effective and Flexible Solution with No "Forklift" Upgrades - modular systems are easily combined in a virtualized GRID to smoothly scale up for larger capacities as needed with no painful "forklift" upgrades, and there is no need to over-buy storage capacity up front
- Capacity utilization is load-balanced across servers - as a single server reaches full utilization, it can leverage space available on other servers in the GRID