[link]
Summary by Shagun Sodhani 8 years ago
# TAO
* Geographically distributed, read-optimized, graph data store.
* Favors availability and efficiency over consistency.
* Developed by and used within Facebook (social graph).
* Link to [paper](https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/data-store/tao-facebook-distributed-datastore-atc-2013.pdf).
## Before TAO
* Facebook's servers directly accessed MySQL to read/write the social graph.
* Memcache used as a look-aside cache.
* Had several issue:
* **Inefficient edge list** - A key-value store is not a good design for storing a list of edges.
* **Distributed Control Logic** - In look-aside cache architecture, the control logic runs on the clients which increase the number of failure modes.
* **Expensive Read-After-Write Consistency** - Facebook used asynchronous master-slave replication for MySQL which introduced a time lag before latest data would reflect in the local replicas.
## TAO Data Model
* **Objects**
* Typed nodes (type is denoted by `otype`)
* Identified by 64-bit integers.
* Contains data in the form of key-value pairs.
* Models users and repeatable actions (eg `comments`).
* **Associations**
* Typed directed edges between objects (type is denoted by `atype`)
* Identified by source object `id1`, `atype` and destination object `id2`.
* Contains data in the form of key-value pairs.
* Contains a 32-bit time field.
* Models actions that happen at most once or records state transition (eg `like`)
* Often `inverse association` is also meaningful (eg `like` and `liked by`).
## Query API
* Support to create, retrieve, update and delete objects and associations.
* Support to get all associations (`assoc_get`) or their count(`assoc_count`)
based on starting node, time, index and limit parameters.
## TAO Architecture
### Storage Layer
* Objects and associations stored in MySQL.
* TAO API mapped to SQL queries.
* Data divided into logical shards.
* Objects bound to the shard for their lifetime(`shard_id` is embedded in `id`).
* Associations stored on the shard of its `id` (for faster association query).
### Caching Layer
* Consists of multiple cache servers (together form a `tier`).
* In memory, LRU cache stores objects, association lists, and association counts.
* Write operation on association list with inverse involves writing 2 shards (for `id1` and `id2`).
* The client sends the query to cache layer which issues inverse write query to shard2 and once that is completed, a write query is made to shard1.
* Write failure leads to hanging associations which are repaired by an asynchronous job.
### Leaders and Followers
* A single, large tier is prone to hot spots and square growth in terms of all-to-all connections.
* Cache split into 2 levels - one **leader tier** and multiple **follower tiers**.
* Clients communicate only with the followers.
* In the case of read miss/write, followers forward the request to the leader which connects to the storage layer.
* Eventual consistency maintained by serving cache maintenance messages from leaders to followers.
* Object update in leaders leads results in `invalidation message` to followers.
* Leader sends `refill message` to notify about association write.
* Leaders also serialize concurrent writes and mediates thundering herds.
## Scaling Geographically
* Since workload is read intensive, read misses are serviced locally at the expense of data freshness.
* In the multi-region configuration, there are master-slave regions for each shard and each region has its own followers, leader, and database.
* Database in the local region is a replica of the database in the master region.
* In the case of read miss, the leader always queries the region database (irrespective of it being the master database or slave database).
* In the case of write, the leader in the local region would forward the request to database in the master region.
## Optimisations
### Caching Server
* RAM is partitioned into `arena` to extend the lifetime of important data types.
* For small, fixed-size items (eg association count), a direct 8-way associative cache is maintained to avoid the use of pointers.
* Each `atype` is mapped to 16-bit value to reduce memory footprint.
### Cache Sharding and Hot Spots
* Load is balanced among followers through `shard cloning`(reads to a shard are served by multiple followers in a tier).
* Response to query include the object's access rate and version number. If the access rate is too high, the object is cached by the client itself. Next time when the query comes, the data is omitted in the reply if it has not changed since the previous version.
### High Degree Objects
* In the case of `assoc_count`, the edge direction is chosen on the basis of which node (source or destination) has a lower degree (to optimize reading the association list).
* For `assoc_get` query, only those associations are searched where time > object's creation time.
## Failure Detection and Handling
* Aggressive network timeouts to detect (potential) failed nodes.
### Database Failure
* In the case of master failure, one of the slaves take over automatically.
* In case of slave failure, cache miss are redirected to TAO leader in the region hosting the database master.
### Leader Failure
* When a leader cache server fails, followers route read miss directly to the database and write to a replacement leader (chosen randomly from the leader tier).
### Refill and Invalidation Failures
* Refill and invalidation are sent asynchronously.
* If the follower is not available, it is stored in leader's disk.
* These messages will be lost in case of leader failure.
* To maintain consistency, all the shards mapping to a failed leader are invalidated.
### Follower Failure
* Each TAO client is configured with a primary and backup follower tier.
* In normal mode, the request is made to primary tier and in the case of its failure, requests go to backup tier.
* Read after write consistency may be violated if failing over between different tiers (read reaches the failover target before writer's `refill` or `invalidate`).
more
less