[link]
Summary by Shagun Sodhani 8 years ago
The structure of the dictionary is quite simple. Let us say that we have to maintain a set $S$, of size $n$, and keys are taken from a universe $U$. The dictionary uses two hash tables, $T$ and $T’$, each consisting of $r$ slots (where $r$ is slightly bigger than $n$). We have two hash functions $h$, $h’$: $U → [r]$. Every key $x \in S$ will either be stored in cell $h(x)$ of T or in cell $h’(x)$ of $T’$. This means lookup takes $O(1)$ time as only 2 slots needs to be checked. Deletion and updation also take $O(1)$ time as they first lookup the key and then delete or update the corresponding value.
Insertion is more interesting and involved than other operations because it uses, what the paper calls, “the cuckoo approach” where a key kicks other keys to it finds a nest for itself. To insert $x$, we check cell $h(x)$ of $T$. If it is free then we are done else we update the value of this cell which would make the previous key “nestless”. Then we insert this newly displaced key in $T’$ in the same manner iteratively till all keys find a nest. The number of iterations is bound by a constant, MaxLoop, to avoid closed loops. After MaxLoop number of iterations, a rehash is performed. A rehash will also be performed once $r^2$ insertions have been performed since the last rehash. This is all one needs to know to implement their own cuckoo-hash. Simple and straightforward.
### Digging Deeper
The approach may, at first, sound like shifting the overhead of lookup to insertion, but there is more to it. We first need to understand what is a universal family.
A family of functions, where each function is defined from $U → R$, is (c, k)-universal if, for any $k$ distinct $x$’s $\in U$ , and any $k y$’s $\in R$, probability that any randomly chosen function from this family maps these $x$’s to $y$’s ≤ c/|R|^k . In effect, it puts a limit on the probability of collisions when randomly picking hashing functions from this family. The paper cites a paper by Alan Siegel to explain how to construct a hash family such that probability of collisions with a set of $r^2$ keys is $O(1/n^2)$.
The paper also uses a clever trick to determine when a rehash would be needed before reaching MaxLoop iterations. If the insertion operation returns to a previously visited cell, it checks if a close loop is going to be formed. Suppose we are inserting key $x_1$ and let $x_1$, $x_2$, …, $x_p$ be the sequence of nestless keys. One of these keys, say xi, becomes nestless for the second time. So it will be put back to its original position. This will cause all the the previous keys in the sequence to be moved back to their original position and $x_1$ would be nestless again. So it will be put to the other table this time. This would cause some keys to becomes nestless in the second table. If one of those keys say $x_k$ moves to a previously visited position, we are guaranteed to have a closed loop. In that case, we do not wait for MaxLoop iterations to be reached and perform rehash.
Now we can work out the probability of insertion loop running for at least t iterations. We already know that t can not exceed MaxLoop. Adding up the probabilities for all possible values of t and accounting for the cost of rehash and taking n to be very large, we get the amortized insertion time to be $O(1)$. For a detailed understanding of this part, refer the original paper.
### My thoughts
The paper has very strong mathematical and experimental backing. It is full of references related to both theoretical work and experimental evaluation. The analysis of insertion is clean and meticulous and all the maths work out beautifully. I particularly liked the elaborate and well-documented experiments. Authors have experimented with a variety of hashing algorithms along with variants of cuckoo hashing. They have highlighted the instances where their implementation differs from the reference and have provided justifications for the same. They have also studied the role of cache in all the hashing techniques in their experiments. They also tested with the dictionary tests of DIMACS implementation challenge. In all experiments, Linear Probing performed better than all other approaches with Cuckoo-Hashing lagging just 20–30% behind. Even then paper presents a strong case for Cuckoo-Hashing.
A few things can still be explored more like how to go about choosing a family of good hash functions as per the constraints. Secondly cache miss rate tends to increase the fastest for cuckoo hashing as load factor increases. Also, it is not possible to have a load factor of more than 50% for cuckoo hashing. So in some cases, the dictionary will have to be resized. The experiments have focused on situations where the size does not vary greatly. Also insertion is expected to be $O(1)$ only for very large values of n though there is no estimate of how big n needs to be. This could be persuaded further.
more
less