Guidelines for Using Compare-by-hash

Guidelines for Using Compare-by-hash

Abstract

Recently, a new technique called compare-by-hash has become popular. Compare-by-hash is a method of content-based addressing in which data is identified only by the cryptographic hash of its contents. Hash collisions are ignored, with the justification that they occur less often than many kinds of hardware errors. Compare-by-hash is a powerful, versatile tool in the software architect's bag of tricks, but it is also poorly understood and frequently misused. The consequences of misuse range from significant performance degradation to permanent, unrecoverable data corruption or loss. The proper use of compare-by-hash is a subject of debate, but recent results in the field of cryptographic hash function analysis, including the breaking of MD5 and SHA-0 and the weakening of SHA-1, have clarified when compare-by-hash is appropriate. In short, compare-by-hash is appropriate when it provides some benefit (performance, code simplicity, etc.), when the system can survive intentionally generated hash collisions, and when hashes can be thrown away and regenerated at any time. In this paper, we propose and explain some simple guidelines to help software architects decide when to use compare-by-hash.
Full paper: PS |PS.gz | PDF

Last updated: Thu Oct 21 01:25:42 MDT 2004

val@nmt.edu