The code monkey's guide to cryptographic hashes for content-based addressing

By Valerie Aurora

Originally appeared in LinuxWorld, 11/12/07, revised 6/25/09


It was 1996, the bandwidth between Australia and the rest of the world was miserable, and Andrew Tridgell had a problem. He wanted to synchronize source code located in Australia with source code on machines around the world, but sending patches was annoying and error-prone, and just sending all the files was painfully slow. Most people would have just waited a few years for trans-Pacific bandwidth to improve; instead Tridgell wrote rsync, the first known instance of content-based addressing (also known as content-addressed storage, or compare-by-hash), an innovation which eventually spread to software like BitTorrent, git, and many document storage products on the market.

Used properly, content-based addressing dramatically reduces bandwidth use, simplifies code, and improves security. But used when it doesn't make sense, it can reduce performance, increase bandwidth usage, and create new security risks. Figuring out when content-based addressing is good and when it is very, very bad requires an understanding of both systems programming and mathematics. What's a programmer to do? In this article, we'll lay out the considerations in language that any programmer (and even some managers) can understand.

Content-based addressing: The basics

Let's start with an example problem. Say you have an MP3 music collection, and your friend sends you a link to a new MP3 but doesn't give you the title. If it's already in your collection, you don't want to bother downloading and storing a second copy, so you want to quickly find out if it is identical to one of your other MP3s. If you only have a few MP3s, you'll just directly compare the new MP3 to each of your existing MP3s. This takes an amount of time proportional to the number of MP3s you already have, and if you have a lot of MP3s (and we know you do), it will run too slowly and you'll look for a smarter algorithm. A method for quickly finding a particular item that is nearly as old as computer science is called the hash table. There are a thousand variations on the hash table, but we'll describe a very simple example, and then use this to understand the tradeoffs of content-based addressing.

Instead of directly comparing the entire MP3 with every other MP3, we could run a function that takes the MP3 data as its input, and outputs a fixed-size signature based on the input. The goal is to generate something like a thumbnail version of a digital photo - the signature doesn't have all the information in the original, but it's a useful hint as to what the whole file contains. For example, we could just XOR together every 4 bytes together - the core of the CRC32 hash function. Hash functions are designed to generate these signatures, or hash values. The same input will always have the same hash value, and conversely, different hash values will always have different inputs. So, instead of comparing all the file data, you'll keep a table of hash values of the MP3s you already have. Then you can ask your friend to send you the CRC32 of the new MP3 and look it up in your hash table. If you don't have an MP3 with that hash value, then you know you don't have your friend's MP3, and it's worthwhile downloading.

Hash collisions

What happens if the hash value of the new MP3 matches one already in your database? While the same input always has the same hash value, different inputs may have identical hash values - a hash collision. For example, the 4-byte XOR of an input consisting of the two 32-bit numbers 0x12340000 and 0x00005678 will be 0x12345678, and the 4-byte XOR of 0x12345678 and 0x00000000 will also be 0x12345678. Because the number of inputs is infinite, and the number of hash values are finite, collisions are inevitable. In fact, some of your own MP3s may already have colliding hash values. The easiest way to deal with this is to make each entry in the hash table a list of MP3s that have that same hash value, along with pointers to their entire data. When you find two MP3s with the same hash value, you compare the whole files to see if they are really the same, or just have the same hash value. Now, instead of comparing with every file, you only compare the MP3 with the few files that have the same hash value - much faster than comparing directly with every single file.

Now comes the interesting part. Sometimes, comparing the file against another file even once is prohibitively expensive, as when the files are located on different machines separated by low-bandwidth networks - say you're at an Antarctic research station and your friend's MP3 has to be trickled through the shared satellite link at a few kilobits per second. (Or maybe you're just still on dial-up.) If you had her send you the hash of the MP3, and you didn't have any file with that same hash, you'd know it was worth downloading the MP3. The problem is when the hash matches up to an existing MP3 in your database - how do you know whether or not they are the same MP3 without slowly and painfully downloading the new MP3 and comparing it? Hash functions have collisions, it's a fact of life. But if you could find a hash function that almost never had collisions, then you could, with very high probability, be sure that if the hashes are the same, the files they represent are also the same. If that probability is high enough, you would feel safe assuming that a collision will never occur in practice (bravely accepting the risk of never hearing that hi-lar-i-ous Weird Al Yankovic song).

Enter cryptographic hash functions

Fortunately, hash functions that are collision-resistant (simply put, very difficult to find inputs with same output) have already been developed for another purpose - they are called cryptographic hash functions and are used for message authentication, digital signatures, obscuring passwords, and more. As programmers, we can "borrow" these functions and use them for other purposes. So now your algorithm for detecting whether two MP3s are the same is very simple: Calculate and store the cryptographic hash of all your MP3s. When you want to know if your friend's MP3 is new, you ask for its cryptographic hash and then compare it to the hashes of all your other MP3s. If it is the same as any of the existing MP3s, you assume it's the same MP3 and don't bother downloading or storing it. Otherwise, you start the download and kill time photographing the penguins outside the research station.

This, in a nutshell, is what is variously called content-based addressing (CBA), compare-by-hash (CBH), or content-addressed storage (CAS). Just calculate the cryptographic hash of a piece of data (often at a granularity smaller than an entire file) and use the resulting hash value as the address of the data. Besides saving on network bandwidth, CBA eliminates the need for a top-down address assignment framework - everyone calculates the same addresses without having to talk to anyone else - which is useful for peer-to-peer networks like BitTorrent and distributed hash tables. It also simplifies implementation - saving state is optional, since all the addresses can be regenerated from the content itself.

Content-based addressing also can have serious downsides, depending on the application and the state of cryptographical research. First, cryptographic hashes are expensive to compute, so in many applications where the cost to transfer or compare data is low, CBA will only slow down the application. If you wanted to compare MP3s on your local hard drive, more mundane techniques, such as traditional hash tables, would be much faster than CBA in most cases. Second, the collision-resistance of a cryptographic hash function degrades over time - the cryptographic hash function that was collision-resistant when the program was first written will no longer be collision-resistant in a few years, after other cryptographers figure out how to find hash collisions quickly (hence, the term "collision-resistant" rather than "collision-free").

In the rest of this article, we'll first review cryptographic hash functions, examining their origin, computational cost, mathematical properties, and "life cycle" as they transition from newly proposed to reliably collision-resistant to broken. Then we'll explore in detail how to judge when content-based addressing will improve an application, and what effect the breaking of the chosen cryptographic hash function will have on the application. With this information under your belt, you can confidently judge when to use content-based addressing.

Cryptographic hash functions for programmers

In this section, we'll give a quick 'n' dirty overview of the aspects of cryptographic hash functions most relevant to programmers using content-based addressing. For those interested in a deeper dive, see the Further Reading section. A wealth of detail is available on the web these days, including most publications and pre-prints, and of course, there's always Applied Cryptography (although it's woefully out of date when it comes to cryptographic hash functions and many pine for a third edition).

The origin of cryptographic hash functions

Most programmers are familiar with the idea of cryptographically signing a document - you use an encryption algorithm to generate a signature using the document as input. Other people can verify your signature given the proper key. The problem is that signing is often a very slow and expensive calculation, proportional to the length of the input. Instead of signing the entire document, it would be nicer to sign a short, fixed-size signature of the document. This signature would have to be easy for everyone to calculate given the original document, but at the same time be very hard for anyone to find another document that has the same signature (in technical parlance, second pre-image resistance). It should also be hard to find a document that has a given signature (pre-image resistance), or any pair of documents with the same signature, ever (collision resistance).

Cryptographic hash functions were designed to fill this and similar needs. The major tradeoffs in their design, as visible to a programmer, are:

Often, the better the cryptograhic hash function, the larger the hash value, and the longer it takes to calculate. Hash functions designed to detecting random corruption are usually much faster and have shorter outputs.

Computational cost

To get a rough idea how much CPU time calculating cryptographic hashes requires, let's take a high level tour of MD5. (MD5 has been fairly thoroughly broken but makes a good example.) Dan Kaminsky describes the MD5 algorithm at a high-level in MD5 To Be Considered Harmful Someday:

In what's referred to as a Merkle-Damgard construction, MD5 starts with an arbitrary initial state 128 bits in length. 512 bits of input data are "stirred" into this 128 bit state, with a new, massively shuffled 128 bit value as the result. 512 more bits are constantly stirred in, over and over, until there's no further data. 64 bits more are appended to the data stream to explicitly reflect the amount of data being hashed with another round of MD5 being done if need be (if there wasn't enough room in a previous round to hash in that 64 bits), and the final 128 bit value after all the stirring is complete is christened the MD5 hash.
The stirring operation involves 64 rounds of operation on each 512 bits of the message (technically, 4 rounds of 16 operations each). The output of each round feeds into the next round - that is, the rounds can't be easily parallelized. Newer, more modern hash functions often use 80 or more rounds of mixing. The overall message is that a cryptographic hash function involves a significant computational cost and multiple operations for every block of data processed.

Mathematics and cryptographic hash functions

A prime number is always a prime number, and no advance in mathematical knowledge will ever change that. But the collision-resistance of a hash function changes over time as cryptography improves. Cryptographic hash functions, are, at heart, human technology, invented by humans to confound other humans limited by current technology and mathematical knowledge. (If you doubt this, ask yourself when the worldwide cryptography community stopped considering SHA-0 a one-way function - and when the NSA did.)

While cryptographic hash functions are mathematical functions, their collision-resistance properties are not mathematically proven - and are often disproven by example, when the hash is broken. Here's what Bruce Schneier has to say about one-way functions, which include cryptographic hash functions, on page 29 of Applied Cryptography:

One-way functions are relatively easy to compute, but significantly harder to reverse. [...] In this context, "hard" is defined as something like: It would take millions of years to compute x from f(x), even if all the computers in the world were assigned to the problem. [...] This sounds good, but it's a lot of smoke and mirrors. If we are being strictly mathematical, we have no proof that one-way functions exist, nor any real evidence that they can be constructed.

This doesn't mean that cryptographic hash functions or encryption algorithms aren't useful or don't work, it just means that their "one-wayness" is not a proven, unchanging mathematical property of the function.

Be wary of thinking of a cryptographic hash function as a magic box that you can stuff an input into and always get a totally unique fixed-size output in return, even if you read a mathematical-sounding definition describing it that way. MD5 no longer has this property, and most experts expect that SHA-1 will lose it within a few years.

Hash collisions

At the most basic level, a cryptographic hash function takes an input of almost any size, and collapses that to a N-bit output. There are far more inputs than outputs, and so inputs with the same outputs, hash collisions, are guaranteed. What makes a hash function a cryptographic hash function is that it is hard for humans, with present knowledge and technology, to find collisions.

How hard is "hard?" A good place to start is the brute force "birthday attack" - simply choose inputs randomly, hash them, and see if they collide. This finds a collision more quickly than seems intuitive. If there are 2N possible hashes, the intuitive guess is that hashing half that number of inputs - 2N - 1 - would result in a 50% chance of a collision. In actuality, only approximately the square root of the total number of hashes is needed - 2N/2 - to have a 50% chance of collision. For a concrete example, SHA-1 has 160 bits in its output, for 2160 possible hash values. The birthday attack would take about 280 hash calculations to have a 50% chance of finding a collision. The birthday attack is an upper bound for how "hard" it is to find collisions in a cryptographic hash function. For a good cryptographic hash function, no attacks are known that are significantly faster than the birthday attack.

One practical implication for the programmer is that the number of inputs being compared should be much smaller than the square root of the possible hash values. If the goal is to hash 2N inputs with negligible probability of an accidental collision between any two outputs, then the output of the hash function needs to have at least N*2 bits. This kind of error is not unknown. Several years ago, a programmer at a famous web company proudly explained to me how their web page cache worked. Each page was addressed by the MD5 hash of its contents, truncated to 64 bits (for convenient use of 64-bit data types). Since there were 264 possible hash values and many fewer than 264 web pages, he believed that the hashes would never collide in practice. I had a rough idea how many web pages existed at that time, but to be sure, I asked him how many pages were in the database. The answer: about 4 billion (232) pages, or roughly the square root of the possible hash values. The chance of a collision was actually about 50% when the programmer believed it was negligible. The web has grown since then; if the cache hadn't already suffered a collision, it almost certainly has since then.

Another intuitive error that comes from thinking of the probability of a collision as 1 in (astronomically large number of possible hash values) is the assumption that the chance of collision is completely random and independent. We're trained to think this way in introductory probability classes - just because the last five dice rolls came up sixes doesn't change the probability of the next roll coming up six. But a cryptographic hash function is deterministic - if an input hashes to six the first time, it will hash to six the next time too. The error creeps in when imagining what happens if you do have a collision - okay, there was a collision, but it was incredibly unlikely in the first place, and the chances of it happening twice are even more unlikely. This isn't true; once a collision, always a collision for that data set (unless you can change your hash function).

The life cycle of a cryptographic hash function

Enough cryptographic hash functions have evolved from state of the art to completely broken that some general trends in the life cycle of a cryptographic hash function have become clear, as summarized in the following table. Each stage takes a few months to a few years, with the stages colored green and yellow being those during which the hash function is reliably collision-resistant (as long as the NSA isn't involved); in all other stages the hash function is likely to suffer collisions.

Stages in the life cycle of cryptographic hash functions
Stage Expert reaction Programmer reaction Non-expert ("slashdotter") reaction
Initial proposal Skepticism, don't recommend use in practice Wait to hear from the experts before adding to OpenSSL SHA-what?
Peer reviewal Moderate effort to find holes and garner an easy publication Used by a particularly adventurous developers for specific purposes Name-drop the hash at cocktail parties to impress other geeks
General acceptance Top-level researchers begin serious work on finding a weakness (and international fame) Even Microsoft is using the hash function now Flame anyone who suggests the function may be broken in our lifetime
Minor weakness discovered Massive downloads of turgid pre-prints from arXiv, calls for new hash functions Start reviewing other hash functions for replacement Long semi-mathematical posts comparing the complexity of the attack to the number of protons in the universe
Serious weakness discovered Tension-filled CRYPTO rump sessions! A full break is considered inevitable Migrate to new hash functions immediately, where necessary Point out that no actual collisions have been found
First collision found Uncork the champagne! Interest in the details of the construction, but no surprise Gather around a co-worker's computer, comparing the colliding inputs and running the hash function on them Explain why a simple collision attack is still useless, it's really the second pre-image attack that counts
Meaningful collisions generated on home computer How adorable! I'm busy trying to break this new hash function, though Send each other colliding X.509 certificates as pranks Tell people at parties that you always knew it would be broken
Collisions generated by hand Memorize as fun party trick for next faculty mixer Boggle Try to remember how to do long division by hand

In more detail, a cryptographic hash function begins life as an initial proposal. At this point, few cryptographers have examined it, and it is considered unreliable; many proposed cryptographic hash functions are cut down in their infancy, broken by another cryptographer soon after they are proposed. As a cryptographic hash function is reviewed and tested by more cryptographers without weaknesses being found, early adopters begin using it in practice and more attention is focused on it. If the hash survives the increased scrutiny that comes along with growing popularity, it enjoys a period of maturity and trust for some years. However, the same popularity that creates confidence in a cryptographic hash also motivates researchers to work on breaking the function. After a few years, a weakness is discovered that allows an attack of a few orders of magnitude less complexity than a brute force attack. At this point, no practical attack exists, but experts view it as a warning sign that a stronger attack is imminent, and begin recommending moving away from the previously pristine hash function. One of the more extraordinary instances of this occurred in 1995, when the NSA recommended moving away from SHA-0 (then known as just plain SHA) and to the slightly modified SHA-1, without explaining exactly why. (SHA-0 was fully broken in 2004.) Around this time, non-experts scoff and compare the complexity of the attack to the number of atoms in the universe or some other impressive physical quantity.

Within a few years of the first theoretical weakness, stronger weaknesses are discovered, with the result that a slightly simplified version of the hash function is vulnerable to an attack using the equivalent of several months of time on a large supercomputer. At this point, the experts recommend dropping the use of this hash immediately, although no practical attack yet exists. Non-experts point out that several months of supercomputer time are hard to get and that normal people have nothing to worry about. Within a few months or years of the discovery of a strong weakness, some ambitious researcher discovers a practical way to generate collisions between fairly meaningless random-looking inputs. Experts declare the hash very broken indeed, open bottles of champagne, and clap the researcher on the shoulder. Non-experts point out that collisions can only be generated in random-looking data, not data that looks like contracts or certificates, so there's still nothing to worry about. Not long after that, improvements in the attack allow some bored programmer to write a program to generate collisions between meaningful inputs on a home computer and release it as open source. Experts don't have any official reaction because they are too busy reviewing new cryptographic hash functions.

Life cycles of popular cryptographic hashes

This chart shows the various life stages of the better-known hashes. You might notice a lot of changes happening in 2004; this is the year that Xiaoyun Wang, et al. published a paper named, simply enough, Collisions for Hash Functions MD4, MD5, HAVAL-128 and RIPEMD. (I'd heard rumors that MD5 had been broken and watched the webcast of the announcement while eating popcorn.) One of my favorite quotes from this paper: "Our attack can find collision [in MD4] with hand calculation." This introduces another stage in the life cycle of a cryptographic hash function: The stage when finding collisions doesn't even require a pocket calculator.

Life cycles of popular cryptographic hashes (the "Breakout" chart)
RIPEMD-128 [1]                                        
SHA-2 family                                        
[1] Note that 128-bit hashes are at best 2^64 complexity to break; using a 128-bit hash is irresponsible based on sheer digest length.
The Hash Function Lounge has an excellent list of references for the dates.

Unbroken cryptographic hash functions

At present, the only hash widely recommended for secure use is the SHA-2 family, which includes SHA-256 and other variants. Most of the other popular cryptographic hashes have been broken or weakened.

Some of the less popular algorithms have not been broken, but often these algorithms are less popular for good reasons. For example, several methods for securely converting a block cipher into a cryptographic hash function are known, such as the Davies-Meyer technique. Unfortunately, the output is the size of the output of the block cipher, usually 64 bits, although modifications exist for doubling the the length to 128 bits. 160 bits is considered the current rock bottom for safe hash output length, with some advocating for 256 bits. Most of the schemes for both converting block ciphers into cryptographic hash functions and increasing their output length have been broken. The performance of block cipher-based hashes tends to be significantly slower than purpose-built cryptographic hashes; Table 18.2 on page 456 of Applied Cryptography compares the performance of Davies-Meyer (using DES), a 128-bit variant of Davies-Meyer (using IDEA), another block cipher-based hash named GOST, and 12 other hash functions. The block cipher-based hashes are the three lowest performing hashes in this evaluation.

Back in the realm of purpose-built cryptographic hash functions, RIPEMD-128 is considered too short by many experts. RIPEMD-160 is a worthy competitor to the SHA-2 family, but has received comparatively little scrutiny and is less trusted. The RIPEMD functions have the simultaneous advantage and disadvantage of being designed entirely outside the U.S.-controlled NSA.

Amateur cryptography

Be wary of inventing a new hash function - that is, adding a salt, double hashing, concatenating hashes, XORing the results, steganographically embedding them into LOLcats pictures, etc. Designing cryptographic hashes is extremely hard, as evidenced by the names MD5 (following MDs 1 through 4), RIPEMD-160 (RIPEMD and RIPEMD-128), and SHA-2 family (SHA and SHA-1). Adding a few operations to a hash function or combining existing functions is a tempting do-it-yourself fix, but usually adds only an insignificant amount of complexity to the attack, and can easily reduce the complexity of attack. From the abstract of the CRYPTO 2004 paper Multicollisions in iterated hash functions by Joux (one of the major contributors to breaking SHA-0):

[...] We solve a long standing open problem and prove that concatenating the results of several iterated hash functions in order to build a larger one does not yield a secure construction.
The basic idea is that, for iterated hash functions, the ability to find hash collisions for n specific hash values allows you to generate 2n inputs that all hash to the same value. That's 2n candidates for collisions in the second hash function, for the work of finding only n hash collisions in the first hash function - in other words, a whole lot of candidates for very little work. Concatenating the hash value to the input, rehashing, and concatenating both outputs is also examined in this paper and found to be subject to the same attacks. It's much smarter to use a hash designed to have 256 bits than to construct one out of 128 or 160 bit hashes.

Most programmers would not try to invent or modify cryptographic encryption algorithms, but don't yet have the same sense of caution with regard to cryptographic hash functions. If you're still determined to do this, read the entire multicollisions paper carefully, check to see if your scheme is described in Applied Cryptography, and try to get a cryptographer to review your design.

In summary, the characteristics of cryptographic hashes most important to programmers are that they are much more expensive to compute than normal hashes, and that the collision-resistance property is not unchanging, but degrades over time.

When is content-based addressing the right choice?

Now that we have the necessary background, we can explore the solution space and figure out when content-based addressing is an advantage. In short, CBA makes sense when it gives some advantage (simpler code, better performance, improved security), and is either safe from a deliberate attack or can be safely upgraded to a new cryptographic hash function if it comes under attack.

Does using CBA confer some advantage?

The first question to ask is if CBA confers some advantage for a particular application - does it reduce code complexity or improve performance, or does it make things worse? I have read or heard several stories about some use of CBA that actually made a system measurably worse - slowed down a disk cache, tried to write out only modified memory to disk when the cost of hashing was greater than the I/O, or calculated cryptographic hashes of data that was almost always different and had to be transferred anyway. It's worth the time to do some high-level review and back-of-the-envelope calculations to find out whether a CBA-based implementation is likely to pay off.

Reduced code complexity

Reduction in code complexity is frequently a major benefit of CBA. It is often far simpler to use a library-provided cryptographic hash function and compare the results than to write a full hash table implementation optimized for a particular use case or keep track of changes to the data.

Improved performance

The basic performance costs of CBA are computing the cryptographic hashes of the data, comparing the hashes (which includes getting the two sets of hashes in the same place if necessary), and transferring or storing any changed data. The benefits gained by using CBA must outweigh these costs, but that tradeoff depends on what fraction of the data is changed, the bandwidth of the IO channel, the size of the hashes, and whether the hashes can be calculated once and reused multiple times. Note that some applications require a cryptographic hash of the data anyway for security reasons, such as a secure distributed file store that doesn't fully trust the storage nodes. In this case, the computational cost of the hash is irrelevant.

Fraction of data changed: The performance of CBA will depend heavily on what percentage of data is different. Let's look at some easy cases first. The simplest case is when the data is always different. In that case, computing the hash is an unnecessary overhead and should only be included if there is some compelling security threat that it would solve. The next simplest case is when the data is always the same. In this case, the comparison is purely between the time it takes to compute, transfer, and compare the hashes, versus the time it takes to transfer the entirety of the data. If the data being compared is in local memory, it is definitely faster to compare it directly than to run 80 rounds of computation for every 512 bits of it. If the data being compared is on a local disk, it's likely to be a win to compare it directly or use a traditional hash table, depending on the ratio of your disk bandwidth and the speed of your CPU. If the data being compared is on opposite sides of a shared low-bandwidth satellite link, CBA is the hands-down winner. Conceptually, the comparison is between the "bandwidth" of the hashing function - how many bytes per second it can process - and the bandwidth of the I/O link (memory bus, I/O bus, network link) between the two pieces of data being compared. The method with more bandwidth is likely to be better.

For the complicated (and common) case in which the data is partly changed and the bandwidth of the hashing function is higher than the bandwidth of the I/O link, the tradeoff point depends on the percentage of data altered and the relative bandwidths of the hash and the I/O (and to a much lesser extent, the size of the hash output). It most cases, there will be a clear advantage in one direction or another after a few minutes of hand calculations - just figure out the hash bandwidth, the I/O bandwidth, and pick a few reasonable values for percentage of data changed. If you want to get fancy, factor in the cost of transferring the hashes over the I/O link.

Measuring bandwidth: Finding the bandwidths of your hash functions and I/O links is easier than ever before. The OpenSSL library has a built-in benchmark command:

$ openssl speed
Selected output from my laptop:

Bandwidth of selected hash functions (KB/s)
Hash 16 byte inputs 64 byte inputs 256 byte inputs 1024 byte inputs 8192 byte inputs
MD2 858.46 1789.04 2515.67 2797.48 2927.39
MD4 7991.10 28268.68 76263.37 141054.08 185523.89
MD5 7002.36 23954.07 68119.32 124396.91 168473.60
RIPEMD-160 5758.02 16670.98 37134.21 52536.75 59553.02
SHA-1 6860.21 21475.32 52002.86 81293.41 97357.06
SHA-256 4256.00 9803.35 18622.66 23416.18 25162.76
SHA-512 2499.55 10173.91 18949.99 28710.90 33324.71

bonnie++ or a simple cat file > /dev/null will find the bandwidth to your local hard drive, and netperf will report the bandwidth of your network. Be sure to choose the correct block, file, or packet sizes to evaluate as bandwidth varies wildly according to the size of the data element being processed or transferred.

Long-term storage of hashes: CBA is often used for addressing and deduplicating data for storage systems. In this case, the hash values are computed once when the data is first stored and reused many times, so the cost of computing the hashes is amortized over many operations. At the same time, storing the hashes will increase code complexity, reducing one of the benefits of CBA.

Alternatives to evaluate: Sometimes the more mundane methods, such as traditional hash tables, traditional differencing, source control systems, and compression, are overlooked when they would actually perform better than CBA. Computing a very cheap hash and then comparing the full data for anything with a hash collision can be faster than computing an expensive cryptographic hash. In the same vein, another strategy is to use a series of progressively stronger hashes, such as the rolling hash used in the first stages of rsync. Keeping track of the differences between two versions (rather than a fully stateless approach) and sending only those differences is another traditional option. Besides the well known but limited diff tool, xdelta computes efficient differences between files in any format. Compression can change both sides of the equation, depending on the ratio between the "bandwidth" of compression and the I/O bandwidth. Again, a little hand computation will often determine the fastest algorithm quickly.

An illustrative example is rsync, the remote file synchronization program. When synchronizing files between local and remote hosts, rsync uses CBA by default. But when the files to be synchronized are in what looks like a local file system to rsync (such as /home/val/src/), it sends the whole file instead of finding and sending only the changed parts of the file. This heuristic is usually right, but can occasionally be fooled by really high bandwidth networks or really low bandwidth disks. From the man page for rsync:

       -W, --whole-file
              With this option the incremental rsync algorithm is not used and
              the  whole  file  is  sent  as-is  instead.  The transfer may be
              faster if this option is used when  the  bandwidth  between  the
              source  and destination machines is higher than the bandwidth to
              disk  (especially   when  the  disk  is   actually  a  networked
              filesystem).  This is the default when both the source and des -
              tination are specified as local paths.

What happens when the chosen cryptographic hash function is broken?

Given that, historically, popular cryptographic hash functions have a useful lifetime of around 10 years, it is only prudent to plan for the eventuality of a successful attack on the hash function. For some applications, the ability to intentionally generate hash collisions has no significant effect. For others, it will result in data corruption or replacement with malicious data.

Trivial or no effect

In some systems, the consequences of a collision are trivial or unnoticeable. If the color value of a single pixel is wrong, or the wrong font is displayed, many people will simply not notice. Search engines make absolutely no guarantees about accuracy and users expect none; returning the wrong page for a search may grate on the programmer's conscience but have no practical impact.

Only trusted users

Many systems today use CBA with thoroughly broken cryptographic hashes; rsync uses MD4 and many archival document storage systems use MD5. Yet these systems continue to be practical and useful, because only trusted users who have no incentive to break the system are allowed to introduce data into the system. True, there will be certain useful data which your users will be unable to store; a real-world example is when cryptographers publish colliding inputs to prove a cryptographic hash function has been broken. (My laptop file system has contained files with colliding MD5 checksums since about the day after the first MD5 collision was made public.) Other systems only allow trusted users to add data to the system; the various CBA-based version control systems like git and Monotone fall into this category. If a user can create hash collisions in the system, they can also directly check in code to accomplish the same effects, so why worry about a fancy hash collision attack? Educate your users not to store colliding data intentionally and the problem is solved.

One way to judge if your system truly falls into this category is to imagine using it with a broken cryptographic hash function (with a sufficient number of bits) and see if you find any problems. In fact, in this case it makes sense to use the fastest appropriate cryptographic hash function regardless of whether it has been broken.

Note that UNIX-style operating systems do not assume users are trusted. File systems and other operating systems services must assume that users are untrusted and may deliberately introduce hash collisions (sometimes not even maliciously, as in the case of verifying published collisions). In general, systems software will have many untrusted or downright malicious users, and will be used in unpredictable ways. Applications can specify valid users and inputs much more narrowly than systems software.

Corruption or security problems

In other systems, the ability to deliberately generate hash collisions results in data corruption or security holes. If untrusted users can add data to the system, then they can, in effect, replace one piece of data with another piece of data, with varying consequences. The degree to which the hash is broken affects the kind of attacks that are feasible (read more about collision resistance, preimage resistance, and second preimage resistance on Wikipedia), but simple thought experiments demonstrate the possibility of significant security problems from even the easiest attack, finding two random inputs that collide. The simplest version is a binary which checks a single bit value and does either the right thing or the wrong thing based on that bit; distribute it with one input that causes it to do the right thing and later replace that input with a colliding input that differs in that bit value (the rest of it can be junk). (Getting people to download your binary, run it, etc. are left as an exercise for the reader.)

The above contrived example is amenable to the easiest hash collision attack, one that can find only arbitrary collisions, but real-world examples of meaningful colliding inputs for MD5 are now legion. A few:

Serious but fixable

Systems with serious effects from hash collisions can still safely use CBA if they have a mechanism for upgrading the hash function when a successful attack appears immanent. Peer-to-peer file-sharing systems like BitTorrent are vulnerable to attack, since any untrusted user can serve pieces of files. But if the protocol allows for a change in cryptographic hash function, users can protect themselves by switching to the new cryptographic hash function and refusing to accept data from peers using the old cryptographic hash function. Other systems are harder to retrofit. File systems or archival services using CBA that store data from untrusted users would need to re-index their storage to change cryptographic hash functions, which requires reading every block in use, computing a new hash, and writing the new value out - a time-consuming and potentially risky operation. Just ask your sysadmin what she thinks about re-indexing in place your company's long-term archival storage.


So when does content-based addressing make sense for your application? Use CBA if:

Conversely, don't use CBA if:


This article could not have been written without the advice and criticism of many programmers and cryptographers over the years. In particular, I would like to thank Fred Douglis, Armando Fox, Yongdae Kim, and Aaram Yun for corrections and many improvements in format and clarity. All errors and inopportune turns of phrase remain, of course, my own.

About the author

Valerie Henson is the founder of VAH Consulting, a company specializing in Linux file systems consulting. She first became interested in cryptographic hash functions at OSDI '02 and would start a Fantasy Hash Function League if anyone else would play it with her. Her first published paper, An analysis of compare-by-hash, sparked debate about content-based addressing but little useful advice for programmers, for which she is trying to make amends.

Further reading

Copyright 2007 - 2009 Valerie Aurora