2

What is the exact difference between data deduplication and data compression.

As of my knowledge data deduplication means when we have exact same copies of data either same block(block level deduplication) or same file(file level deduplication) then only one copy is preserved in the storage and to that copy the number of ref count is incremented each time the block or file is used by different users.

But how compression works internally.

Please help me out of this. Thanks in advance.

s.patra
  • 139
  • 1
  • 8
  • There's so much information on this out on the Internet. Google e.g. `data compression and data deduplication`. Read the [Wikipedia article on data compression.](https://en.wikipedia.org/wiki/Data_compression) The possibilities are endless. – Pekka Feb 14 '16 at 10:07
  • Thanks for the link. But I want an answer, like how it works internally. Just like in deduplication only one block is preserved and the ref count goes on increasing with increase in number of users. Similarly how it happens in case of compression? – s.patra Feb 14 '16 at 10:42
  • Deduplication removes redundant data blocks, whereas compression removes additional redundant data within each data block. These techniques work together to reduce the amount of space required to store the data. -https://docs.vmware.com/en/VMware-vSphere/6.5/com.vmware.vsphere.virtualsan.doc/GUID-3D2D80CC-444E-454E-9B8B-25C3F620EFED.html – Shiwangini May 24 '20 at 18:10

2 Answers2

11

The short answer is that deduplication can be considered a highly specialized form of compression, targeting a particular context. The long answer comes next.

Before contrasting these techniques, let's talk a bit about how typical compression works.

Compression

Compression itself is extremely varied. You have lossy compression algorithms, such as JPEG and MP3, which use a model of how we see or hear to throw away some information that may not be as important to the image or sound, but still reduces the quality. Based on your question, these techniques are mostly out of the scope of the question.

You are probably mostly concerned with what we'd call general purpose lossless algorithms such as zip, LZMA, LZ4, etc, which compress arbitrary files in a reversible way. Usually these compress files using at least several of the techniques in the non-exhaustive list below:

  1. Match Finding. Finding redundancies in the (strings of repeated bytes) and replacing the repetition with shorter sequences. For example, such algorithms might have the string:

    developers developers developers developers

and then replace that with something like:

developers (0,11)(0,22)

Where (0,11) means "re-use the 11 characters starting at position 0". That's known as "match finding" or LZ77-style compression and is straightforward.

  1. Entropy Coding. You might start with a string like:

    AABCABBCABACBAAACBCCAABAAACBAA

This looks pretty random, right? You might notice, however, than some letters appear more than others - A appears about 2x as much as B and C, and the other letters don't appear at all!

Using that information, you can choose an encoding that represents the characters in the string with less information, e.g., A may be encoded using binary 0, while B and C are assigned 10 and 11 respectively. If you were originally using 8 bits per character, that is a big savings.

  1. Modeling

Most data has complex relationships that aren't necessarily well compressed by the simply techniques above, but rather need some type of model. For example, you may have various models that predicts the value of a pixel in an image based on the neighboring pixels. You may have a model that predicts the most likely next word in a sentence based on the sentence to that point. For example, if I said: Who let the dogs ___, you would probably be able to fill in the blank with high accuracy.

None of these are mutually exclusive - they are often used in complementary fashion and there are additional techniques not mentioned above.

Now, before we discuss what deduplication is, exactly, it's worth noting typical characteristics of compression algorithms. These are not absolute rules, but are common characteristics of many compression algorithms, unless they have been specifically designed to avoid them:

No simple relationship between the input bytes and output bytes.

The input and the output are related in a complex way (unlike say, Base-64 encoding, where every 3 contiguous input bytes corresponds, in order, to 4 contiguous output bytes). The implications are as follows:

  • You often cannot simply take compressed data and decompress an arbitrary portion of it, such as "decompress the last 500 bytes of this file". You may need to read the entire compressed file from the beginning or at least start from some well-known point in the stream.

  • Modification of the uncompressed input may have arbitrarily large impacts on the compressed output. For example, changing a single-byte in the input may change every subsequent byte in the output. This often means it is difficult to update a large compressed stream incrementally (i.e., based on modifications to the input).

Deduplication

So given the above definition and discussion of compression, what is usually meant by deduplication?

Today, you usually head about deduplication in the contest of storage devices or architectures. It is a way of, for example, saving disk space when large amounts of duplicate data is present (imagine, for example, having 100 VM images on a SAN - there is likely to be a lot of duplication among the operating system and other common files on each VM).

Deduplication is a way of storing this redundant data only once. Essentially, it implements technique (1) above, on a large scale, without some of the limitations discussed above. So it is simply a form of compression that operates on large blocks, and across an entire drive or across an entire storage host, or even across a cluster of networked machines.

Now you can't just "gzip" the whole drive however, because deduplication should be transparent, functionally and performance-wise. The APIs offered by the file system (e.g., POSIX or Win32, etc) allow users to write to arbitrary parts of the file. If a user modifies 1 byte in a 1GB file, they would be surprised if this took a minute or more to decompress then compress the entire file.

So deduplication works in a way that random access into a file is still possible; e.g., by having an index such that the location of any byte can be located). This usually means that deduplication is only working with large match (block) sizes, or else the cost of tracking the blocks becomes prohibitive. Some systems only detect duplication which meet other criteria, such having the same alignment within files.

Deduplication generally happens transparently (the user of the file system is not aware of it), and it may also happen asynchronously: i.e., when new data is written, it is initially treated as unique, and only later will it be checked for duplication, and possibly merged with existing data.

In short deduplication can be thought of a specific application of one type of compression, tuned to the domain it will be used in: removing some limitations of typical compression algorithms in exchange for acceptable performance, but at the cost of only removing large duplicated regions, and generally eschewing other compression opportunities such as (2) entropy coding or (3) modelling.

BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
1

For NetApp compression & dedupe specifics, look at "NetApp Data Compression and Deduplication Deployment and Implementation Guide". The short answer, for TL;DR types, is that dedupe works on the 4k WAFL block level, on a per-volume basis, while compression works on up to 32k compression groups, on a per-file basis (but can only be enabled/disabled per entire volume). Both compression and dedupe can be run either inline and/or post-process as of the latest CDOT release.

Richard Erickson
  • 2,438
  • 8
  • 24
  • 36