Theory of Computing
-------------------
Title : Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream
Authors : Kai-Min Chung, Michael Mitzenmacher, and Salil Vadhan
Volume : 9
Number : 30
Pages : 897-945
URL : http://www.theoryofcomputing.org/articles/v009a030
Abstract
--------
Hashing is fundamental to many algorithms and data structures
widely used in practice. For the theoretical analysis of hashing,
there have been two main approaches. First, one can assume that the
hash function is truly random, mapping each data item independently
and uniformly to the range. This idealized model is unrealistic
because a truly random hash function requires an exponential number
of bits (in the length of a data item) to describe. Alternatively,
one can provide rigorous bounds on performance when explicit
families of hash functions are used, such as 2-universal or
$O(1)$-wise independent families. For such families, performance
guarantees are often noticeably weaker than for ideal hashing.
In practice, however, it is commonly observed that simple hash
functions, including 2-universal hash functions, perform as predicted
by the idealized analysis for truly random hash functions. In this
paper, we try to explain this phenomenon. We demonstrate that the
strong performance of universal hash functions in practice can arise
naturally from a combination of the randomness of the hash function
and the data. Specifically, following the large body of literature on
random sources and randomness extraction, we model the data as coming
from a "block source," whereby each new data item has some "entropy"
given the previous ones. As long as the R\'enyi entropy per data item
is sufficiently large, it turns out that the performance when choosing
a hash function from a 2-universal family is essentially the same as
for a truly random hash function. We describe results for several
sample applications, including linear probing, chained hashing,
balanced allocations, and Bloom filters.
Towards developing our results, we prove tight bounds for hashing
block sources, determining the entropy required per block for the
distribution of hashed values to be close to uniformly distributed.
The paper arose by merging the following two conference papers:
"Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream"
by M.M. and S.V. [Proc. 19th Ann. ACM-SIAM Symposium on Discrete Algorithms
(SODA), pages 746-755, 2008] (http://dl.acm.org/citation.cfm?id=1347164),
and "Tight Bounds for Hashing Block Sources" by K-M.C. [Proc. 11th
International Workshop, APPROX 2008, and 12th International Workshop,
RANDOM 2008 on Approximation, Randomization and Combinatorial
Optimization: Algorithms and Techniques, pages 357 - 370, 2008]
(http://dl2.acm.org/citation.cfm?id=1429822).