|
Welcome to ShortScience.org! |
|
|
[link]
The paper captures the story of how Facebook created a high-performing, distributed key-value store on top of Memcache (which was back then a simple, in-memory cache) and scaled it to support the world’s largest social network. In Facebook’s context, users consume much more content than they create. So the workload is read intensive and caching help to reduce the workload.MORE ON IT. Facebook uses Memcache Clusters where each Memcache instance is a demand-filled, look-aside cache. This means if a client requests data from Memcache and data is not available, the client would fetch the data from the database and would populate the cache for further requests. Memcache, not being a loading cache, does not have to worry about the logic of retrieving data from the database and can be easily plugged with multiple databases. In case of write requests, the client issues an update command to the database and a delete command is sent to Memcache. Deletion, being idempotent, is preferred over updation. While the overview seems pretty simple, there are more details in actual implementation. Facebook considers and optimizes for 3 deployments scales — a single cluster of servers, data replication between different clusters and spreading clusters across the world. #### A single cluster of servers These are characterized by a highly read intensive workload with requests having a wide fan out. Around 44% of requests contact more than 20 Memcache servers. For popular pages, this number spans well over 100 distinct servers. One reason for this is that each request returns a very small amount of data. In case of get requests, UDP performs better than TCP and get errors are treated as cache miss though insertion is not performed. This design choice seems practical as only .25% of requests fail due to late/ dropped or out of order packets. Though the response size is very small, the variation is quite large with mean = 954 bytes and median = 135 bytes. Set and delete operations are still performed over TCP (for reliability) though the connections are coalesced to improve efficiency. Within a cluster, data is distributed across hundreds of servers through consistent hashing. A very high request rate combined with large fanout leads to an all to all communication between Memcache servers and clients and even a single server can become a bottleneck for many requests. Clients construct a DAG representing the dependency between data so that more independent requests are fired concurrently. Facebook also provides a standalone proxy called mcrouter that acts as an interface to Memcache server interface and routes the requests/replies to/from other servers. Along with these, flow control mechanisms in the form of sliding window mechanism are provided to limit incast congestion. #### Lease Leases are used to address stale sets (when web server writes a stale value in the cache) and thundering herds (when a key undergoes heavy read and write operations). When a client experiences a cache-miss, Memcache gives it a lease (a 64-bit token bound to the requested key). This lease is verified by Memcache when client tries to set the value. If Memcache receives a delete request for the key, the lease is invalidated and the value can not be set. To mitigate thundering herds, Memcache returns a token only once every 10 seconds per key. If a read request comes within 10 seconds of a token issue, the client is notified to retry after a short time, by which the updated value is expected to be set in the cache. In situations where returning stale data is not much problem, the client does not have to wait and stale data (at most 10 second old) is returned. #### Memcache Pools Since different workloads can have different access patterns and requirements, Memcache clusters are partitioned into pools. The default pool is called wildcard and then there are separate pools for keys that can not reside in the wildcard pool. A small pool can be provisioned for keys for which cache miss is not expensive. Data is replicated within a pool when request rate is quite high and data can easily fit into a few machines. In Facebook’s context, replication seems to work better than sharding though it has the additional overhead of maintaining consistency. In case a few Memcache servers fail, a small set of machines, called gutters, take over. In case of more widespread failures, requests are diverted to alternate clusters. Regions Multiple front end clusters (web and Memcache clusters) along with a storage cluster (database) defines a region. The storage cluster has the authoritative copy of the data which is replicated across the frontend clusters. To handle data modifications, invalidation daemons called mcsqueal are deployed on each database which parse the queries, extract and group delete statements and broadcast them to all the front end cluster in the region. The batched delete operations are sent to mcrouter instances in each frontend cluster, which then unpack the individual deletes and route them to the concerned Memcache server. As an optimisation, a web server which modifies its data also sends invalidations to its own cluster to ensure read-after-write semantics for a single user request. #### Cold cluster Warmup When a new cluster is brought online, it takes time to get populated and initially the cache hit rate is very low. So a technique called Cold Cluster Warmup is employed where a new cold cluster is populated with data from a warm cluster instead of the database cluster. This way the cold cluster comes to full capacity within few hours. But additional care is needed to account for race conditions. One example could be: a client in cold cluster makes an update and before this update reaches the warm cluster, another request for the same key is made by the cold cluster then the item in the cold cache would be indefinitely inconsistent. To avoid this, Memcache rejects add operations for 2 seconds (called holdoff time)once a delete operation is taken place. So if a value is updated in a cold cluster and a subsequent request is made within 2 seconds, the add operation would be rejected indicating that the data has been updated. 2 seconds is chosen as most updates seems to propagate within this time. #### Across Region consistency Clusters comprising a storage cluster and several front end clusters are deployed in different regions. Only one of the regions contains the master database and rest act as replicas. This adds the challenge of synchronisation. Writes from a master region send invalidations only within the master region. Sending invalidations outside may lead to a race situation where deletes reach before data is replicated. Facebook uses mcsequal daemon helps to avoid that at the cost of serving stale data for some time. 'Writes from a non-master region are handled differently. Suppose a user updates his data from a non-master region with a large replication lag. A cache refill from a replica database is allowed only after the replication stream has caught up. A remote marker mechanism is used to minimise the probability if reading stale data. The presence of a marker indicates that data in the replica is stale and the query is redirected to the master region. When a webserver updates a key k, it sets a remote marker rk in the region, performs the write to the master database having key k and deletes k in the local cluster. When it tries to read k next time, it will experience a cache miss, will check if rk exists and will redirect its query to the master cluster. Had rk not been set, the query would have gone to the local cluster itself. Here latency is introduced to make sure most updated data is read. #### Single Server Improvements Facebook introduced many improvements for Memcache servers running as single instances as well. This includes extending Memcache to support automatic expansion of the hash table without the look-up time drifting to $O(n)$, making the server multi-threaded using a global lock and giving each thread its own UDP port to reduce contention. Memcache uses an Adaptive Slab Allocator which organizes memory into slab classes — pre-allocated, uniformly sized chunks of memory. Items are stored in smallest possible slab class which can fit the data. Slab sizes start at 64 bytes and reach up to 1 Mb. Each slab class maintains a free-list of available chunks and requests more memory in 1MB in case the free-list is empty. If no more free memory can be allocated, eviction takes place in Least Recently Used (LR) fashion. The allocator periodically rebalances slab assignments to keep up with the workload. Memory is freed in less used slabs and given to more heavily used slabs. Most of the items are evicted lazily from the memory though some are evicted as soon as they are expired. Short lived items are placed into a circular buffer of linked lists (indexed by seconds until expiration) — called Transient Item Cache — based on the expiration time of the item. Every second, all of the items in the bucket at the head of the buffer are evicted and the head advances by one. By adding a short expiration time to heavily used set of keys whose items with short useful lifespans, the proportion of Memcache pool used by this key family was reduced from 6% to 0.3% without affecting the hit rate. #### Lessons Learnt Memcache’s design decisions are driven by data and not just intuition. Facebooks seems to have experimented with a lot of configurations before arriving on decisions like using UDP for read operations and choosing the value for parameters like holdoff time. This is how it should be — data-driven decisions. In the entire process of developing Memcache, Facebook focused on optimizations which affect a good number of users and usecases instead of optimizing for each possbile use case. Facebook separated caching layer from persistence layer which makes it easier to mould the 2 layers individually. By handling the cache insertion logic in the application itself, Facebook made it easier to plug Memcache with different data stores. By modeling the probability of reading the stale data as a parameter, they allowed the latency and hit rate on the persistence store to be configurable to some extent thus broadening the scope for Memcache. Facebook breaks down a single page load request into multiple requests, thereby allowing for different kind of stale data tolerance for each component. For example, the servers serving the profile picture can cache content longer than the servers serving messages. This helps to lower the response latency and reduce the load on the data layer. Facebook’s Memcache is one of the many solutions aimed to solve a rather common problem — a scalable, distributed key-value store. Amazon’s Dynamo solves is well for a write-intensive workload for small data sizes and Facebook solves it for a read intensive workload. Other contenders in the list are Cassandra, Voldemort, Redis, Twitter’s Memcache and more. The sheer number of possible, well known solutions suggests that there are many dimensions to the problem, some of which are yet unknown and that we still do not have a single solution that can be used for all use cases. ![]() |
|
[link]
Amazon’s platform is built upon different techniques working together to provide a single powerful, highly-available system. One of the core components powering this system is Dynamo. There are many services in the Amazon ecosystem which store and retrieve data by primary key. Take the example of the shopping cart service. Customers should be able to view and update their cart anytime. For these services, sophisticated solutions like RDBMS, GFS, etc are an overkill as these services do not need a complex query and data management system. Instead, they need a service which only supports read and write operations on a (key, value) store where the value is a small object (less than 1Mb in size) uniquely identified by the key. The service should be scalable and highly available with a well-defined consistency window. This is what Dynamo is: a scalable, distribute, highly available key-value store that provides an “always-on” experience. #### Design Considerations Dynamo achieves high availability at the cost of weaker consistency. Changes propagate to the replicas in the background and conflict resolution is done at read time to make sure none of the write operations can fail. Dynamo uses simple policies like “last write win” for conflict resolution though applications using Dynamo may override these techniques with their own methods. Eg the cart application may choose to add items across all versions to make sure none of the items is lost. A service could depend on multiple services to get its results. To guarantee that a service returns its results in a bounded time, each dependency in the service has to return its results with even tighter bounds. As a result clients enter into a contract with servers regarding service-related characteristics like expected request distribution rate, expected latency and so on. Such an agreement is called Service Level Agreement(SLA) and must be met to ensure efficiency. SLA apply in the context of Dynamo as well. Dynamo supports incremental scaling where the system is able to scale out one node at a time. Moreover, all the nodes are symmetrical in the sense they have the same set of responsibilities. Since Dynamo is used only by Amazon’s internal applications, there are no security related requirements like authentication and authorization . #### Architecture Dynamo exposes two operations: get() and put(). get(key) returns value or list of values along with context objects corresponding to the key. put(key, context, value) stores the value and the context corresponding to the key. context objects are used for conflict resolution. To support incremental scaling, Dynamo uses consistent hashing for its partitioning scheme. In consistent hashing, the output range of a hash function is treated as a fixed circular space. Each node and data object is assigned a random value or position within this space. A data object is mapped to the first node which is placed clockwise to the position of the data object. Every data item is replicated at N hosts. So every time a data item is assigned to a node, it is replicated to N-1 clockwise successor nodes as well. The list of nodes storing a data item is called its preference list. Generally preference list contains more than N nodes to account for system and network failures. An example case is shown with N = 3. Any key between A and B would be mapped to B (by consistent hashing logic) and to C and D (by replication logic). https://cdn-images-1.medium.com/max/800/1*66VMYcQfvG3Z2acQD7aeYQ.png Each time data is created/updated, a new version of data is created. So for a given key, several versions of data (or value) can exist. For versioning, Dynamo uses vector clocks. A vector clock is a list of (node, counter) pairs. When a put operation reaches node X, the node uses the context from the put request to know which version it is updating. If there is an entry corresponding to X in vector clock, the counter is incremented else a new entry is created for node X with counter = 1. When retrieving value corresponding to a key, the node will resolve conflicts among all versions based on Dynamo’s logic or client’s logic. A likely issue with this approach is that the vector clock list may grow very large. To mitigate this, Amazon keeps evicting pairs from the list in ascending order of the time when the entry was created till the size reaches below a threshold. Amazon has not faced any issues related to loss of accuracy with this approach. They also observed that the % of data with at least 2 versions is about 0.06% Dynamo uses a quorum system to maintain consistency. For a read (or write) operation to be successful R (or W) number of replicas out of N replicas must participate in the operation successfully with the condition that R+W > N. If some of the first N replicas are not available, say due to network failure, the read and write operations are performed on the first N healthy nodes. eg if node A is down then node B can be included in its place for the quorum. In this case, B would keep track of data it received on behalf of A and when A comes online, B would hand over this data to A. This way a sloppy quorum is achieved. It is possible that B itself becomes unavailable before it can return the data to A. In this case, anti-entropy protocols are used to keep replicas synchronized. In Dynamo, each node maintains a Merkle tree for each key range it hosts. Nodes A and B exchange the roots of Merkle trees corresponding to set of keys they both host. Merkle tree is a hash tree whose leaves are hash values of individual keys and parents are hash values of children. This allows branches to be checked for replication without having to traverse the entire tree. A branch is traversed only when the hash values at the top of the branch differ. This way the amount of data to be transferred for synchronization is minimized. The nodes in a cluster communicate as per a gossip-based protocol in which each node contacts a random peer and then the two nodes reconcile their persisted membership history. This ensures an eventually consistent membership view. Apart from this, some nodes are marked as seed nodes which are known to all nodes including the ones that join later. Seed nodes ensure that logical partitions are not created within the network even when new nodes are added. Since consistent hashing is used, the overhead of key reallocation when adding a new node is quite low. #### Routing There are 2 modes of routing requests in Dynamo. In the first mode, servers route the request. The node fulfilling the request is called coordinator. If it is a read request, any node can act as the coordinator. For a write request, the coordinator is one of the nodes from the key’s current preference list. So if the write request reaches a node which is not in the preference list, it routes the request to one of the nodes in the preference list. An alternate approach would be where the client downloads the current membership state from any Dynamo node and determine which node to send the write request to. This approach saves an extra hop within the server cluster but it assumes the membership state to be fresh. #### Optimizations Apart from the architecture described above, Dynamo uses optimizations like read-repair where, during quorum, if a node returns a stale response for a read query, it is updated with the latest version of data. Similarly, since writes follow reads, the coordinator for read operation is the node that replies fastest to the previous read operation. This increases the chances of having read you write consistency. To further reduce the latency, each node maintains an object buffer in its main memory where write operations are stored and written to disk by a separate thread. The read operations also first refer the in-memory buffer before checking the disks. There is an added risk of the node crashing before writing the objects from the buffer to the disk. To mitigate this, one of the N replicas performs a durable write — that is, the data is written to the disk. Since the quorum requires only W responses, latency due to one node does not affect the performance. Amazon also experimented with different partitioning schemes to ensure uniform load distribution and adopted the scheme where hash space is divided into Q equally sized partitions and placement of partition is decoupled from the partitioning scheme. #### Lessons Learnt Although Dynamo is primarily designed as a write intensive data store, N, R and W provides ample control to modify its behavior for other scenarios as well. For example, setting R = 1 and W = N makes it a high performance read engine. Services maintaining product catalog and promotional items can use this mode. Similarly setting W = 1 means a write request is never rejected as long as at least one server is up though this increases the risk of inconsistency. Given that Dynamo allows the clients to override the conflict resolution methods, it becomes a general solution for many more scenarios than it was originally intended for. One limitation is the small size of data for which it is designed. The choice makes sense in the context of Amazon but it would be interesting to see how storing larger values affects its performance. The response time would obviously increase as more data needs to be transferred and in-memory buffers would be able to store lesser data. But using caching and larger in memory buffers, the response time may be brought down to the limit that Dynamo can be used with somewhat larger data objects as well. Dynamo scales well for a few hundred of nodes but it will not scale equally well for tens of thousands of nodes because of the large overhead of maintaining and distributing the routing table whose size increases with the number of nodes. Another problem that Amazon did not have to face was a high conflict resolution rate. They observed that around 99.94% requests saw exactly one version. Had this number been higher, the latency would have been more. All in all, Dynamo is not a universal solution for a distributed key-value store. But it solves one problem and it solves it very well. ![]() |
|
[link]
#### What are BLOBs Binary Large OBjects (BLOBs) refer to immutable binary data. A BLOB is created once, read many times, never modified, and sometimes deleted. In Facebook’s context, this would include photos, videos, documents, traces, heap dumps, and source code. Facebook was originally using Haystack as its BLOB storage system. But it is designed for IO bound workloads and not storage efficiency. To take a more specific example, when we post a new picture, a BLOB is created. Its data is written to the BLOB storage system and a handle corresponding to this data is stored in a Graph Store for quick retrieval later on. Now when someone clicks on that picture, a read request is fired and the handle is fetched from the graph store. Using this handle, the data (in this case the picture) can be retrieved. A CDN is in place to cache frequently accessed BLOBs. #### BLOB Storage Design BLOBs are aggregated into logical volumes. These volumes are initially unlocked and support reads, writes, and deletes. Once a volume is full, it gets locked and no more creates are allowed. Each volume consists of three files: a data file to hold BLOBs and their metadata, an index file which is a snapshot of the in-memory index of the storage machine, and a journal file to track which BLOBs have been deleted. For fault tolerance and performance, each logical volume maps into multiple physical volumes with each volume living entirely on one Haystack host. Haystack has fault tolerance to disk, host, rack, and data center failure, at an effective replication-factor of 3.6. All seems good so far but the problem is with the large and ever increasing storage footprint of the BLOBs. Facebook made an interesting observation in this regard — there is a large and increasing number of BLOBs with low request rates. So for these BLOBs, triple replication is an overkill. If only there was a separate BLOB storage system for these BLOBs. This is where f4 comes into the picture. #### Warm vs Hot Facebook benchmarked their existing implementation with a 2 week trace and observed that a kind of temperature zone exists where some BLOBs have a very high request rate (these are called the hot BLOBs) and some have a low request rate (these are called the warm BLOBs). f4 is proposed as an implementation for warm BLOB storage while Haystack would be used for hot BLOBs. Another observation was that age and temperature of BLOB are correlated. New BLOBs are queried and deleted at a much higher rate. Lastly, they observed that warm content is large and growing which furthers the need for f4. #### Design goals f4 was designed to reduce the effective-replication-factor without compromising on reliability and performance. #### Overview f4 is comprised of a number of cells, each cell residing completely in one data center. A cell is responsible for reliably storing a set of locked volumes and uses Reed-Solomon coding to store these volumes with a lower storage overhead. The downside is an increased rebuild and recovery time under failure and lower maximum read throughput. Since f4 cells support only read and delete operations, only data and index files are needed and both are read-only. For tracking deletes, all BLOBs are encrypted with keys stored in an external database. Deleting the key for a BLOB in f4 logically deletes it. The index files use triple replication within a cell and the data file is encoded and stored via a Reed-Solomon (n,k) code. The file is logically partitioned into contiguous sequences of n blocks, each of size b. For each such sequence of n blocks, k parity blocks are generated, making the size of the stripe n + k blocks. For a given block in a stripe, the other blocks in the stripe are called its companion blocks. BLOBs can be read directly from their data block. If a block is unavailable it can be recovered by decoding any n of its companion and parity blocks. The block-size for encoding is kept quite large (around 1 Gb) as it decreases the number of BLOBs that span multiple blocks, thereby reducing I/O operations to read and it reduces the amount of per-block metadata that f4 needs to maintain. But a very large size would mean a larger overhead for rebuilding the blocks. #### Components 1. Name Node: This node maintains the mapping between data blocks and parity blocks and the storage nodes that hold the actual blocks. 2. Storage Nodes: These nodes expose two APIs - an Index API that provides existence and location information for volumes, and a File API that provides access to data. 3. Backoff Nodes: These nodes are storage-less, CPU-heavy nodes that handle the online reconstruction of request BLOBs (not the entire block). 4. Rebuilder Nodes: They are similar to Backoff nodes, but they handle the offline reconstruction of entire blocks. 5. Coordinator Nodes: These nodes are storage-less, CPU-heavy nodes that handle maintenance task, such as scheduling block rebuilding and ensuring an optimal data layout. A separate geo-replicated XOR coding scheme is also used to tolerate data center or geographic region failure. In this, each volume/stripe/block is paired with a buddy volume/stripe/block in a different geographic region. Then an XOR of the buddy blocks (called as XOR Block) is stored in a third region. The effective replication factor turns out to be 2.1 #### Are the results good? Results are amazing! With a corpus of 65PB, f4 will save Facebook over 39 PB and 87 PB of storage at effective-replication-factor of 2.8 and 2.1 respectively. All this comes with low latency and fault tolerance. #### Lessons learnt The existence of temperature zones is not unique to Facebook. Such zones would be present in all kind of data at a large-enough scale with the line seperating these zones depending on the request and delete rates. Since older data is likely to have a different query rate than newer data, efficient migration from hot to warm storage before putting to cold storage needs to be explored more. This also suggests that one single data management system can not handle data of all ages as the constraints on the data start to change. In Facebook’s context, any data written to Haystack was constrained by at-most-one-writer requirement while data on f4 was free of this constraint. So 2 different data models, each optimized for its own use case could be used. Till now we have seen data management systems based on nature of data (relational or NoSQL), or based on nature of queries (read vs write-oriented). But this case study opens the door for a new kind of system which migrates data from one data model to another based on temperature zones. This is what Facebook ended up creating for this particular scenario. This case study also reinforces the superiority of modular architecture. Facebook has a clear need of separate data storage mechanism but what made it possible was its modular architecture which allowed for easy migration of data from Haystack to f4. Apparently Facebook’s overall architecture is designed to enable warm storage. For example, the caching stack would cache the results related to the most popular content which is expected to be newer content. Haystack can handle most of the reads and deletes thereby reducing the load on f4 and so on. I would be looking out for similar case studies from other data giants as well. Probably they are tackling this problem from an entirely different perspective. Whatever their approaches may be, it would be interesting to compare them with f4. ![]() |
|
[link]
It presents an informal discussion about do’s and dont’s of research — more specifically experimental study of algorithms. The author presents 10 important principles backed by examples from his long experience. I am listing points from the reading. #### 1. Solve a problem worth solving. Take some time to find a good problem. Decide what questions you want to answer before you actually start finding an answer. Finding a new solution would take up time and resources. So think before you compute. Do exploratory research and experimentation. But do not get into the endless loop of exploration. #### 2. Tie your paper to the literature. Know the prior art for your problem. Find what are the existing solutions. Think if another solution is actually needed and if yes, what needs to be improved. Tell how your work relates to existing literature. This will help you and your readers. It will make your experiments newsworthy. #### 3. Use instance testbeds that can support general conclusions. When doing research, especially experimental research, you are highly likely to use data — for making some hypothesis, for validating results and so on. More general your data set is, more general or applicable your results would be. Applying conclusions drawn from a specific data set on a general problem would be disastrous. #### 4. Document everything. Document your code. Document your data sets. Document your experimental setup, your results, your mistakes, your conclusions. Document everything that some other person would need to replicate your experiments. This will help your future-self save some time and he would be grateful to you. Read #6. When using randomly generated data sets, use the same data set for benchmarking all approaches. It would reduce time and variance. Use sampling and bootstrapping. #### 5. Use reasonably efficient implementations. Seems obvious. But efficiency requires effort and we may be tempted especially when execution time in not a metric of interest. More efficient code means you can perform more experiments and on larger data sets. But do not over-optimise. Remember — Premature optimization is the root of all evil. #### 6. Ensure Reproducibility. Provide information on how to reproduce your experiments. Read #4. See gitxiv. Share your code and data set with others. If the data set is generated using some algorithm, share its details. Use reproducible standards of comparison. Do not say “results after running the algorithm for 1 hour”. Machines are evolving every day and hence doing more operations than before for the same amount of time. #### 7. Ensure Compatibility. Comparability goes beyond reproducibility and documentation. Benchmark the machine speed and the other experimental setup that you use so that other researchers can accurately compare their result with yours. Take a full backup of your code and data. #### 8. Report the full story. Report the complete story, including the parts you do not like. Do not hide anomalous results. Instead, try to explain them. It could lead to more insights to the problem and to the solution. Report the full running time, even if time is not a metric of interest from the problem’s point of view. Many times readers want to see how your solution performs on the scale of time. #### 9. Draw well-justified conclusions and look for explanations. Summarize trends, infer implications and provide conclusions. If there are no conclusions and inferences, then you fail the newsworthiness test. Probably you picked the wrong question or your solution is incomplete. Justify your conclusions with data. If needed, use techniques like profiling. #### 10. Present Your Data in Informative Ways. Your data representation should highlight what you want the reader to infer from the data. This can include things like appropriate ordering of columns in a table or choice of variables to be shown along the axis. If you want the reader to compare percentage rise in time, provide a column for that. Do not make the reader do the arithmetic. Take care of not overwhelming your readers with data alone. Put additional data in appendix. ![]() |
|
[link]
The structure of the dictionary is quite simple. Let us say that we have to maintain a set $S$, of size $n$, and keys are taken from a universe $U$. The dictionary uses two hash tables, $T$ and $T’$, each consisting of $r$ slots (where $r$ is slightly bigger than $n$). We have two hash functions $h$, $h’$: $U → [r]$. Every key $x \in S$ will either be stored in cell $h(x)$ of T or in cell $h’(x)$ of $T’$. This means lookup takes $O(1)$ time as only 2 slots needs to be checked. Deletion and updation also take $O(1)$ time as they first lookup the key and then delete or update the corresponding value. Insertion is more interesting and involved than other operations because it uses, what the paper calls, “the cuckoo approach” where a key kicks other keys to it finds a nest for itself. To insert $x$, we check cell $h(x)$ of $T$. If it is free then we are done else we update the value of this cell which would make the previous key “nestless”. Then we insert this newly displaced key in $T’$ in the same manner iteratively till all keys find a nest. The number of iterations is bound by a constant, MaxLoop, to avoid closed loops. After MaxLoop number of iterations, a rehash is performed. A rehash will also be performed once $r^2$ insertions have been performed since the last rehash. This is all one needs to know to implement their own cuckoo-hash. Simple and straightforward. ### Digging Deeper The approach may, at first, sound like shifting the overhead of lookup to insertion, but there is more to it. We first need to understand what is a universal family. A family of functions, where each function is defined from $U → R$, is (c, k)-universal if, for any $k$ distinct $x$’s $\in U$ , and any $k y$’s $\in R$, probability that any randomly chosen function from this family maps these $x$’s to $y$’s ≤ c/|R|^k . In effect, it puts a limit on the probability of collisions when randomly picking hashing functions from this family. The paper cites a paper by Alan Siegel to explain how to construct a hash family such that probability of collisions with a set of $r^2$ keys is $O(1/n^2)$. The paper also uses a clever trick to determine when a rehash would be needed before reaching MaxLoop iterations. If the insertion operation returns to a previously visited cell, it checks if a close loop is going to be formed. Suppose we are inserting key $x_1$ and let $x_1$, $x_2$, …, $x_p$ be the sequence of nestless keys. One of these keys, say xi, becomes nestless for the second time. So it will be put back to its original position. This will cause all the the previous keys in the sequence to be moved back to their original position and $x_1$ would be nestless again. So it will be put to the other table this time. This would cause some keys to becomes nestless in the second table. If one of those keys say $x_k$ moves to a previously visited position, we are guaranteed to have a closed loop. In that case, we do not wait for MaxLoop iterations to be reached and perform rehash. Now we can work out the probability of insertion loop running for at least t iterations. We already know that t can not exceed MaxLoop. Adding up the probabilities for all possible values of t and accounting for the cost of rehash and taking n to be very large, we get the amortized insertion time to be $O(1)$. For a detailed understanding of this part, refer the original paper. ### My thoughts The paper has very strong mathematical and experimental backing. It is full of references related to both theoretical work and experimental evaluation. The analysis of insertion is clean and meticulous and all the maths work out beautifully. I particularly liked the elaborate and well-documented experiments. Authors have experimented with a variety of hashing algorithms along with variants of cuckoo hashing. They have highlighted the instances where their implementation differs from the reference and have provided justifications for the same. They have also studied the role of cache in all the hashing techniques in their experiments. They also tested with the dictionary tests of DIMACS implementation challenge. In all experiments, Linear Probing performed better than all other approaches with Cuckoo-Hashing lagging just 20–30% behind. Even then paper presents a strong case for Cuckoo-Hashing. A few things can still be explored more like how to go about choosing a family of good hash functions as per the constraints. Secondly cache miss rate tends to increase the fastest for cuckoo hashing as load factor increases. Also, it is not possible to have a load factor of more than 50% for cuckoo hashing. So in some cases, the dictionary will have to be resized. The experiments have focused on situations where the size does not vary greatly. Also insertion is expected to be $O(1)$ only for very large values of n though there is no estimate of how big n needs to be. This could be persuaded further. ![]() |