Summaries from USENIX Operating Systems Design and Implementation Conference on ShortScience.org

www.usenix.org
sci-hub
scholar.google.com

Scaling Distributed Machine Learning with the Parameter Server
Li, Mu and Andersen, David G. and Park, Jun Woo and Smola, Alexander J. and Ahmed, Amr and Josifovski, Vanja and Long, James and Shekita, Eugene J. and Su, Bor-Yiing
USENIX Operating Systems Design and Implementation Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Michael Whittaker 6 years ago

A parameter server is more or less a distributed key-value store optimized for training machine learning models. For example, imagine we're learning a weight vector $w = (w_1, w_2, w_3)$ using logistic regression. We can distribute $w$ across two shards of the parameter server where one shard stores $(1, w_1)$ and the other stores $(2, w_2)$ and $(3, w_3)$. Worker nodes can then read parts of the weight vector, perform some computation, and write back parts of the weight vector.

This paper presents an optimized parameter server with the following features:

1. Efficient communication.
2. Flexible consistency models.
3. Elastic scalability.
4. Fault tolerance and durability.
5. Ease of use.

#### Machine Learning

Many machine learning algorithms try to find a weight vector $w \in \mathbb{R}^d$ that minimizes a regularized cost function of the following form:

$$ F(w) = \Omega(w) + \sum_{i=1}^n l(x_i, y_i, w) $$

When $n$ and $d$ get really big, it becomes intractable to run these algorithms on a single machine. Instead, we have to parallelize the algorithm across multiple machines.

As an example, consider how to perform distributed batch gradient descent across $m$ workers. We initialize $w$ and store it in a parameter server. Concurrently, each worker begins by reading $\frac{1}{m}$ of the training data. Then, every worker reads $w$ from the parameter server and computes a gradient with respect to its local training data (actually, it only needs to read the relevant parts of $w$). Then, it pushes its gradient to the parameter server. Once the server receives gradients from every worker, it aggregates them together and updates $w$. Finally, workers pull the updated $w$ and loop.

#### Architecture

A parameter server consists of a bunch of servers that store weights and a bunch of workers that perform computations with the weights (e.g. compute gradients). Servers are organized into a server group managed by a server manager. Workers are organized into multiple worker groups, and each worker group is managed by a task scheduler. The server manager manages which data is assigned to which server. The task scheduler assigns tasks to workers and monitors progress.

Parameters are stores as key-value pairs. For example, a weight vector $w \in \mathbb{R}^d$ can be stored as a set of pairs $(i, w_i)$ for $1 \leq 1 \leq d$. To store sparse vectors more efficiently, only non-zero entries of $w$ must be explicitly stored. If a pair $(i, w_i)$ is missing, the parameter server assumes $w_i = 0$.

Most machine learning algorithms do not update individual entries of a weight vector at a time. Instead, they typically update part of a matrix or part of a vector. To take advantage of this workload pattern, the parameter server allows workers to read and write ranges of values instead of single values at a time. This reduces network overhead.

In addition to pushing and pulling entries of $w$, workers can also register user-defined functions to run at a server. For example, a server can compute the gradient of a regularization term.

By default, the parameter server runs tasks asynchronously. That is, if a worker issues a pull or push request, it does not block. However, the parameter server also allows workers to explicitly mark dependencies between different requests which forces them to serialize.

Some algorithms are robust to weir inconsistencies. These algorithms can often run faster with weaker consistency. The parameter server provides three levels of consistency:

1. Sequential consistency in which every request is totally serialized.
2. Eventual consistency in which requests are run whenever they please.
3. Bounded delay in which a request is delayed until all tasks that began τ time ago have completed.

Users can also specify that a certain consistency model only apply to a certain subset of key-value pairs as specified by a filter.

#### Implementation

Data is partitioned across servers using consistent hashing, and the server manager records the assignment of key ranges to machines. When a new server joins:

1. The server manager assigns a new key range to the server.
2. The server fetches its data from other servers.
3. The server manager broadcasts the update to the other servers who relinquish data they no longer own.

The parameter server uses chain replication to replicate data. Each node forms a chain with the $k$ previous nodes in the hashing ring. Workers send updates to the master which is chain replicated to the next $k$ servers.

Logically, the parameter server tags each key-value pair with a vector clock (though honestly, I'm not exactly sure I understand why). Physically, each range of key-value pairs is associated with a vector clock. This range-based vector clock avoids storing redundant vector clock information.

Messages sent from workers to servers are compressed with Snappy. Moreover, servers cache parts of messages, and workers can send a hash instead of a whole message if they the think a server has it cached.

www.usenix.org
sci-hub
scholar.google.com

f4: Facebook's Warm BLOB Storage System
Subramanian, Muralidhar and Lloyd, Wyatt and Roy, Sabyasachi and Hill, Cory and Lin, Ernest and Liu, Weiwen and Pan, Satadru and Shankar, Shiva and Viswanathan, Sivakumar and Tang, Linpeng and Kumar, Sanjeev
USENIX Operating Systems Design and Implementation Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 7 years ago

#### What are BLOBs

Binary Large OBjects (BLOBs) refer to immutable binary data. A BLOB is created once, read many times, never modified, and sometimes deleted. In Facebook’s context, this would include photos, videos, documents, traces, heap dumps, and source code. Facebook was originally using Haystack as its BLOB storage system. But it is designed for IO bound workloads and not storage efficiency.

To take a more specific example, when we post a new picture, a BLOB is created. Its data is written to the BLOB storage system and a handle corresponding to this data is stored in a Graph Store for quick retrieval later on. Now when someone clicks on that picture, a read request is fired and the handle is fetched from the graph store. Using this handle, the data (in this case the picture) can be retrieved. A CDN is in place to cache frequently accessed BLOBs.

#### BLOB Storage Design
BLOBs are aggregated into logical volumes. These volumes are initially unlocked and support reads, writes, and deletes. Once a volume is full, it gets locked and no more creates are allowed. Each volume consists of three files: a data file to hold BLOBs and their metadata, an index file which is a snapshot of the in-memory index of the storage machine, and a journal file to track which BLOBs have been deleted. For fault tolerance and performance, each logical volume maps into multiple physical volumes with each volume living entirely on one Haystack host. Haystack has fault tolerance to disk, host, rack, and data center failure, at an effective replication-factor of 3.6.

All seems good so far but the problem is with the large and ever increasing storage footprint of the BLOBs. Facebook made an interesting observation in this regard — there is a large and increasing number of BLOBs with low request rates. So for these BLOBs, triple replication is an overkill. If only there was a separate BLOB storage system for these BLOBs. This is where f4 comes into the picture.

#### Warm vs Hot

Facebook benchmarked their existing implementation with a 2 week trace and observed that a kind of temperature zone exists where some BLOBs have a very high request rate (these are called the hot BLOBs) and some have a low request rate (these are called the warm BLOBs). f4 is proposed as an implementation for warm BLOB storage while Haystack would be used for hot BLOBs. Another observation was that age and temperature of BLOB are correlated. New BLOBs are queried and deleted at a much higher rate. Lastly, they observed that warm content is large and growing which furthers the need for f4.

#### Design goals
f4 was designed to reduce the effective-replication-factor without compromising on reliability and performance.

#### Overview
f4 is comprised of a number of cells, each cell residing completely in one data center. A cell is responsible for reliably storing a set of locked volumes and uses Reed-Solomon coding to store these volumes with a lower storage overhead. The downside is an increased rebuild and recovery time under failure and lower maximum read throughput.

Since f4 cells support only read and delete operations, only data and index files are needed and both are read-only. For tracking deletes, all BLOBs are encrypted with keys stored in an external database. Deleting the key for a BLOB in f4 logically deletes it. The index files use triple replication within a cell and the data file is encoded and stored via a Reed-Solomon (n,k) code. The file is logically partitioned into contiguous sequences of n blocks, each of size b. For each such sequence of n blocks, k parity blocks are generated, making the size of the stripe n + k blocks. For a given block in a stripe, the other blocks in the stripe are called its companion blocks. BLOBs can be read directly from their data block. If a block is unavailable it can be recovered by decoding any n of its companion and parity blocks.
The block-size for encoding is kept quite large (around 1 Gb) as it decreases the number of BLOBs that span multiple blocks, thereby reducing I/O operations to read and it reduces the amount of per-block metadata that f4 needs to maintain. But a very large size would mean a larger overhead for rebuilding the blocks.

#### Components
1. Name Node: This node maintains the mapping between data blocks and parity blocks and the storage nodes that hold the actual blocks.
2. Storage Nodes: These nodes expose two APIs - an Index API that provides existence and location information for volumes, and a File API that provides access to data.
3. Backoff Nodes: These nodes are storage-less, CPU-heavy nodes that handle the online reconstruction of request BLOBs (not the entire block).
4. Rebuilder Nodes: They are similar to Backoff nodes, but they handle the offline reconstruction of entire blocks.
5. Coordinator Nodes: These nodes are storage-less, CPU-heavy nodes that handle maintenance task, such as scheduling block rebuilding and ensuring an optimal data layout.

A separate geo-replicated XOR coding scheme is also used to tolerate data center or geographic region failure. In this, each volume/stripe/block is paired with a buddy volume/stripe/block in a different geographic region. Then an XOR of the buddy blocks (called as XOR Block) is stored in a third region. The effective replication factor turns out to be 2.1

#### Are the results good?
Results are amazing! With a corpus of 65PB, f4 will save Facebook over 39 PB and 87 PB of storage at effective-replication-factor of 2.8 and 2.1 respectively. All this comes with low latency and fault tolerance.

#### Lessons learnt
The existence of temperature zones is not unique to Facebook. Such zones would be present in all kind of data at a large-enough scale with the line seperating these zones depending on the request and delete rates. Since older data is likely to have a different query rate than newer data, efficient migration from hot to warm storage before putting to cold storage needs to be explored more. This also suggests that one single data management system can not handle data of all ages as the constraints on the data start to change. In Facebook’s context, any data written to Haystack was constrained by at-most-one-writer requirement while data on f4 was free of this constraint. So 2 different data models, each optimized for its own use case could be used. Till now we have seen data management systems based on nature of data (relational or NoSQL), or based on nature of queries (read vs write-oriented). But this case study opens the door for a new kind of system which migrates data from one data model to another based on temperature zones. This is what Facebook ended up creating for this particular scenario.

This case study also reinforces the superiority of modular architecture. Facebook has a clear need of separate data storage mechanism but what made it possible was its modular architecture which allowed for easy migration of data from Haystack to f4. Apparently Facebook’s overall architecture is designed to enable warm storage. For example, the caching stack would cache the results related to the most popular content which is expected to be newer content. Haystack can handle most of the reads and deletes thereby reducing the load on f4 and so on.

I would be looking out for similar case studies from other data giants as well. Probably they are tackling this problem from an entirely different perspective. Whatever their approaches may be, it would be interesting to compare them with f4.