Distributed GraphLab: A Framework for Machine Learning in the Cloud
Yucheng Low
and
Joseph Gonzalez
and
Aapo Kyrola
and
Danny Bickson
and
Carlos Guestrin
and
Joseph M. Hellerstein
arXiv e-Print archive - 2012 via Local arXiv
Keywords:
cs.DB, cs.LG
First published: 2012/04/26 (12 years ago) Abstract: While high-level data parallel frameworks, like MapReduce, simplify the
design and implementation of large-scale data processing systems, they do not
naturally or efficiently support many important data mining and machine
learning algorithms and can lead to inefficient learning systems. To help fill
this critical void, we introduced the GraphLab abstraction which naturally
expresses asynchronous, dynamic, graph-parallel computation while ensuring data
consistency and achieving a high degree of parallel performance in the
shared-memory setting. In this paper, we extend the GraphLab framework to the
substantially more challenging distributed setting while preserving strong data
consistency guarantees. We develop graph based extensions to pipelined locking
and data versioning to reduce network congestion and mitigate the effect of
network latency. We also introduce fault tolerance to the GraphLab abstraction
using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can
be easily implemented by exploiting the GraphLab abstraction itself. Finally,
we evaluate our distributed implementation of the GraphLab abstraction on a
large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains
over Hadoop-based implementations.