[link]
Summary by Shagun Sodhani 8 years ago
#### What are BLOBs
Binary Large OBjects (BLOBs) refer to immutable binary data. A BLOB is created once, read many times, never modified, and sometimes deleted. In Facebook’s context, this would include photos, videos, documents, traces, heap dumps, and source code. Facebook was originally using Haystack as its BLOB storage system. But it is designed for IO bound workloads and not storage efficiency.
To take a more specific example, when we post a new picture, a BLOB is created. Its data is written to the BLOB storage system and a handle corresponding to this data is stored in a Graph Store for quick retrieval later on. Now when someone clicks on that picture, a read request is fired and the handle is fetched from the graph store. Using this handle, the data (in this case the picture) can be retrieved. A CDN is in place to cache frequently accessed BLOBs.
#### BLOB Storage Design
BLOBs are aggregated into logical volumes. These volumes are initially unlocked and support reads, writes, and deletes. Once a volume is full, it gets locked and no more creates are allowed. Each volume consists of three files: a data file to hold BLOBs and their metadata, an index file which is a snapshot of the in-memory index of the storage machine, and a journal file to track which BLOBs have been deleted. For fault tolerance and performance, each logical volume maps into multiple physical volumes with each volume living entirely on one Haystack host. Haystack has fault tolerance to disk, host, rack, and data center failure, at an effective replication-factor of 3.6.
All seems good so far but the problem is with the large and ever increasing storage footprint of the BLOBs. Facebook made an interesting observation in this regard — there is a large and increasing number of BLOBs with low request rates. So for these BLOBs, triple replication is an overkill. If only there was a separate BLOB storage system for these BLOBs. This is where f4 comes into the picture.
#### Warm vs Hot
Facebook benchmarked their existing implementation with a 2 week trace and observed that a kind of temperature zone exists where some BLOBs have a very high request rate (these are called the hot BLOBs) and some have a low request rate (these are called the warm BLOBs). f4 is proposed as an implementation for warm BLOB storage while Haystack would be used for hot BLOBs. Another observation was that age and temperature of BLOB are correlated. New BLOBs are queried and deleted at a much higher rate. Lastly, they observed that warm content is large and growing which furthers the need for f4.
#### Design goals
f4 was designed to reduce the effective-replication-factor without compromising on reliability and performance.
#### Overview
f4 is comprised of a number of cells, each cell residing completely in one data center. A cell is responsible for reliably storing a set of locked volumes and uses Reed-Solomon coding to store these volumes with a lower storage overhead. The downside is an increased rebuild and recovery time under failure and lower maximum read throughput.
Since f4 cells support only read and delete operations, only data and index files are needed and both are read-only. For tracking deletes, all BLOBs are encrypted with keys stored in an external database. Deleting the key for a BLOB in f4 logically deletes it. The index files use triple replication within a cell and the data file is encoded and stored via a Reed-Solomon (n,k) code. The file is logically partitioned into contiguous sequences of n blocks, each of size b. For each such sequence of n blocks, k parity blocks are generated, making the size of the stripe n + k blocks. For a given block in a stripe, the other blocks in the stripe are called its companion blocks. BLOBs can be read directly from their data block. If a block is unavailable it can be recovered by decoding any n of its companion and parity blocks.
The block-size for encoding is kept quite large (around 1 Gb) as it decreases the number of BLOBs that span multiple blocks, thereby reducing I/O operations to read and it reduces the amount of per-block metadata that f4 needs to maintain. But a very large size would mean a larger overhead for rebuilding the blocks.
#### Components
1. Name Node: This node maintains the mapping between data blocks and parity blocks and the storage nodes that hold the actual blocks.
2. Storage Nodes: These nodes expose two APIs - an Index API that provides existence and location information for volumes, and a File API that provides access to data.
3. Backoff Nodes: These nodes are storage-less, CPU-heavy nodes that handle the online reconstruction of request BLOBs (not the entire block).
4. Rebuilder Nodes: They are similar to Backoff nodes, but they handle the offline reconstruction of entire blocks.
5. Coordinator Nodes: These nodes are storage-less, CPU-heavy nodes that handle maintenance task, such as scheduling block rebuilding and ensuring an optimal data layout.
A separate geo-replicated XOR coding scheme is also used to tolerate data center or geographic region failure. In this, each volume/stripe/block is paired with a buddy volume/stripe/block in a different geographic region. Then an XOR of the buddy blocks (called as XOR Block) is stored in a third region. The effective replication factor turns out to be 2.1
#### Are the results good?
Results are amazing! With a corpus of 65PB, f4 will save Facebook over 39 PB and 87 PB of storage at effective-replication-factor of 2.8 and 2.1 respectively. All this comes with low latency and fault tolerance.
#### Lessons learnt
The existence of temperature zones is not unique to Facebook. Such zones would be present in all kind of data at a large-enough scale with the line seperating these zones depending on the request and delete rates. Since older data is likely to have a different query rate than newer data, efficient migration from hot to warm storage before putting to cold storage needs to be explored more. This also suggests that one single data management system can not handle data of all ages as the constraints on the data start to change. In Facebook’s context, any data written to Haystack was constrained by at-most-one-writer requirement while data on f4 was free of this constraint. So 2 different data models, each optimized for its own use case could be used. Till now we have seen data management systems based on nature of data (relational or NoSQL), or based on nature of queries (read vs write-oriented). But this case study opens the door for a new kind of system which migrates data from one data model to another based on temperature zones. This is what Facebook ended up creating for this particular scenario.
This case study also reinforces the superiority of modular architecture. Facebook has a clear need of separate data storage mechanism but what made it possible was its modular architecture which allowed for easy migration of data from Haystack to f4. Apparently Facebook’s overall architecture is designed to enable warm storage. For example, the caching stack would cache the results related to the most popular content which is expected to be newer content. Haystack can handle most of the reads and deletes thereby reducing the load on f4 and so on.
I would be looking out for similar case studies from other data giants as well. Probably they are tackling this problem from an entirely different perspective. Whatever their approaches may be, it would be interesting to compare them with f4.
more
less