Your Mileage Will Vary: Chunking

I've said many times that nowhere in the field of information technology are the words "your mileage my vary" truer than when discussing data deduplication. How much your data will shrink when run through a given vendor's data deduplication engine can vary significantly depending on the data you're trying to dedupe and how well that particular deduplication engine handles that kind of data. One critical factor is how the deduping engine breaks your data down into chunks.

Most data deduplication engines work by breaking the data into chunks and using a hash function to help identify which chunks contain the same data. Once the system has identified duplicate data, it stores one copy and uses pointers in its internal file, or chunk management system, to keep track of where that chunk was in the original data set.

While most deduplication systems uses this basic technique, the details of how they decide what data goes into what chunk varies significantly. Some systems just take your data and break it into fixed-size chunks. The system may, for example, decide that a chunk is 8KBytes or 64KBytes and then break your data into 8KByte chunks, regardless of the content of the data.

Other systems analyze the data mathematically and choose spots that generate higher or lower values from their secret chunk-making function as the boundary between data chunks. On these systems, data chunks will vary in size based on the magic formula but within some limits, so chunks on these systems may be 8KBytes to 64KBytes, depending on the data.

If we implement these two techniques on backup appliances and back up a set of servers with a conventional backup application like NetBackup or Arcserve, the backup app will walk the file system and concatenate the data into a tape format file on the backup appliance.

Tags:

Commentary

Data Centers

Recommended For You

Juniper Networks Announces AI-Native Networking Platform

Zeus Kerravala, Founder and Principal Analyst with ZK Research

January 31, 2024

Bob Friday, Chief AI Officer for Juniper Networks, explains how the advanced technology is transforming operations.

Understanding Why Contact Center Agent Empowerment is Critical to a Great Customer Experience

Zeus Kerravala, Founder and Principal Analyst with ZK Research

January 29, 2024

Contact center leaders from 8x8, Awaken Intelligence, and 360insight discuss the importance of agent experience.

AI Drives the Ethernet and InfiniBand Switch Market

David Curry, Technology Writer

January 27, 2024

AI may force enterprises to rewire parts of their data centers so they are fully optimized to run such workloads. The question is do you use Ethernet or InfiniBand?

Search form