[link]
# Conditional Generative Adversarial Nets ## Introduction * Conditional version of [Generative Adversarial Nets (GAN)](https://gist.github.com/shagunsodhani/1f9dc0444142be8bd8a7404a226880eb) where both generator and discriminator are conditioned on some data **y** (class label or data from some other modality). * [Link to the paper](https://arxiv.org/abs/1411.1784) ## Architecture * Feed **y** into both the generator and discriminator as additional input layers such that **y** and input are combined in a joint hidden representation. ## Experiment ### Unimodal Setting * Conditioning MNIST images on class labels. * *z* (random noise) and **y** mapped to hidden layers with ReLu with layer sizes of 200 and 1000 respectively and are combined to obtain ReLu layer of dimensionality 1200. * Discriminator maps *x* (input) and **y** to maxout layers and the joint maxout layer is fed to sigmoid layer. * Results do not outperform the state-of-the-art results but do provide a proof-of-the-concept. ### Multimodal Setting * Map images (from Flickr) to labels (or user tags) to obtain the one-to-many mapping. * Extract image and text features using convolutional and language model. * Generative Model * Map noise and convolutional features to a single 200 dimensional representation. * Discriminator Model * Combine the representation of word vectors (corresponding to tags) and images. ## Future Work * While the results are not so good, they do show the potential of Conditional GANs, especially in the multimodal setting. |
[link]
# Addressing the Rare Word Problem in Neural Machine Translation ## Introduction * NMT(Neural Machine Translation) systems perform poorly with respect to OOV(out-of-vocabulary) words or rare words. * The paper presents a word-alignment based technique for translating such rare words. * [Link to the paper](https://arxiv.org/abs/1410.8206) ## Technique * Annotate the training corpus with information about what do different OOV words (in the target sentence) correspond to in the source sentence. * NMT learns to track the alignment of rare words across source and target sentences and emits such alignments for the test sentences. * As a post-processing step, use a dictionary to map rare words from the source language to target language. ## Annotating the Corpus ### Copy Model * Annotate the OOV words in the source sentence with tokens *unk1*, *unk2*,..., etc such that repeated words get the same token. * In target language, each OOV word, that is aligned to some OOV word in the source language, is assigned the same token as the word in the source language. * The OOV word in the target language, which has no alignment or is aligned with a known word in the source language. is assigned the null token. * Pros * Very straightforward * Cons * Misses out on words which are not labelled as OOV in the source language. ### PosAll - Positional All Model * All OOV words in the source language are assigned a single *unk* token. * All words in the target sentences are assigned positional tokens which denote that the *jth* word in the target sentence is aligned to the *ith* word in the source sentence. * Aligned words that are too far apart, or are unaligned, are assigned a null token. * Pros * Captures complete alignment between source and target sentences. * Cons * It doubles the length of target sentences. ### PosUnk - Positional Unknown Model * All OOV words in the source language are assigned a single *unk* token. * All OOV words in the target sentences are assigned *unk* token with the position which gives the relative position of the word in the target language with respect to its aligned source word. * Pros: * Faster than PosAll model. * Cons * Does not capture alignment for all words. ## Experiments * Dataset * Subset of WMT'14 dataset * Alignment computed using the [Berkeley Aligner](https://code.google.com/archive/p/berkeleyaligner/) * Used architecture from [Sequence to Sequence Learning with Neural Networks paper](https://gist.github.com/shagunsodhani/a2915921d7d0ac5cfd0e379025acfb9f). ## Results * All the 3 approaches (more specifically the PosUnk approach) improve the performance of existing NMTs in the order PosUnk > PosAll > Copy. * Ensemble models benefit more than individual models as the ensemble of NMT models works better at aligning the OOV words. * Performance gains are more when using smaller vocabulary. * Rare word analysis shows that performance gains are more when proposition of OOV words is higher. |
[link]
# Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models ## Introduction * The paper presents a novel open vocabulary NMT(Neural Machine Translation) system that translates mostly at word level and falls back to character level models for rare words. * Advantages: * Faster and easier to train as compared to character models. * Does not produce unknown words in the translations which need to be removed using *unk replacement* techniques. * [Link to the paper](https://arxiv.org/abs/1604.00788) ## Unk Replacement Technique * Most NMT operate on constrained vocabulary and represent unknown words with *unk* token. * A post-processing step replaces *unk* tokens with actual words using alignment information. * Disadvantages: * These systems treat words as independent entities while they are morphologically related. * Difficult to capture things like name translation. ## Proposed Architecture ### Word-level NMT * Deep LSTM encoder-decoder. * Global attention mechanism and bilinear attention scoring function. * Similar to regular NMT system except in the way unknown words are handled. ### Character-level NMT * Deep LSTM model used to generate on-the-fly representation of rare words (using final hidden state from the top layer). * Advantages: * Simplified architecture. * Efficiency through precomputation - representations for rare sources words can be computed at once before each mini-batch. * The model can be trained easily in an end-to-end fashion. #### Hidden-state Initialization * For source representation, layers of the LSTM are initialized with zero hidden states and cell values. * For target representation, the same strategy is followed except for the hidden state of the first layer where one of the following approaches are used: * **same-path** target generation approach * Use the context vector just before softmax (of word-level NMT). * **seperate-path** target generation approach * Learn a new weight matrix **W** that will be used to generate the context vector. ### Training Objective * *J = J<sub>w</sub> + αJ<sub>c</sub>* * *J* - total loss * *J<sub>w</sub>* - loss in a regular word-level NMT * *αJ<sub>c</sub>* - loss in the character-level NMT ### Word Character Generation Strategy * The final hidden state from character-level decoder could be interpreted as the representation of *unk* token but this approach would not be efficient. * Instead, *unk* is fed to the word-level decoder as it is so as to decouple the execution for the character-level model as soon the word-level model finishes. * During testing, a beam search decoder is run at the word level to find the best translation using the word NMT alone. * Next, a character-level encoder is used to generate the words in place of *unk* to minimise the combined loss. ## Experiments ### Data * WMT’15 translation task from English into Czech with newstest2013 (3000 sentences) as dev set and newstest2015 (2656 sentences) as a test set. ### Metrics * Case-sensitive NIST BLEU. * chrF3 ### Models * Purely word based * Purely character based * Hybrid (proposed model) ### Observations * Hybrid model surpasses all the other systems (neural/non-neural) and establishes a new state-of-the-art result for English-Czech translation in WMT’15 with 19.9 BLEU. * Character-level models, when used as a replacement for the standard unk replacement technique in NMT, yields an improvement of up to +7.9 BLEU points. * Attention is very important for character-based models as the non-attentional character models perform poorly. * Character models with shorter time-step backpropagation perform inferior as compared to ones with longer backpropagation. * Separate-path strategy outperforms same-path strategy. ### Rare word embeddings * Obtain representations for rare words. * Compare the Spearman correlation between similarity scores assigned by humans and by the model. * Outperforms the recursive neural network model (which also uses a morphological analyser) on this task. |
[link]
# Improving Word Representations via Global Context and Multiple Word Prototypes ## Introduction * This paper pre-dated papers like Glove and Word2Vec and proposed an architecture that * combines local and global context while learning word embeddings to capture the word semantics. * learns multiple embeddings per word to account for homonymy and polysemy. * [Link to the paper](http://www.aclweb.org/anthology/P12-1092) ## Global Context-Aware Neural Language Model ### Training Objective * Given a word sequence *s* (local context) and a document *d* in which the sequence occurs (global context), learn word representations while learning to discriminate the last correct word in *s* from other words. * *g(s, d)* - scoring function giving liklihood of correct sequence. * *g(s<sup>w</sup>, d)* - scoring function giving liklihood of last word in *s* repalced by a word *w*. * Objective - *g(s, d)* > *g(s<sup>w</sup>, d)* + 1 for any other word *w*. ### Architecture * Two scoring components (neural networks) to capture: * Local Context * Map word sequence *s* into an ordered list of vectors *x = [x<sub>1</sub>, ..., x<sub>m</sub>]*. * *x<sub>i</sub>* - embedding corresponding to *i<sup>th</sup>* word in the sequence. * Compute local score *score<sub>l</sub>* by using a neural network (with one hidden layer) over *x*. * Preserves word order and syntactic information. * Global Context * Map document *d* to an ordered list of word embeddings, *d = (d<sub>1</sub>, ..., d<sub>k</sub>)*. * Compute *c*, the weighted average of all word vectors in document. * The paper uses *idf* score for weighting the documents. * *x = * concatenation of *c* and vector of the last word in *s*. * Compute global score *score<sub>g</sub>* by using a neural network (with two hidden layers) over *x*. * Similar to bag-of-words features. *score = score<sub>l</sub> + score<sub>g</sub>* * Train the weights of the hidden layers and the word embeddings. ### Multi-Prototype Neural Language Model * Words can have different meanings in different contexts which are difficult to capture when we train only one vector per word. * Solution - train multiple vectors per word to capture the different meanings. * Approach * Gather all the fixed-sized context windows for all occurrences of a given word. * Find the context vector by performing weighted averaging of all the words in the context window. * Cluster the context vectors using spherical k-means. * Each word occurrence in the corpus is re-labeled to its associated cluster. * To find similarity between a pair of words *(w, w')*: * For each possible cluster of *i* and *j* corresponding to the words *w* and *w'*, find distance between cluster centers for *i* and *j* and weight them by the product of probabilities of *w* belonging to *i* and *w'* belonging to *j* given their respective contexts. * Average the value over the *k<sup>2</sup>* pairs. ## Training * Dataset * Wikipedia corpus * Parameters * 10-word windows * 100 hidden units * No weight regularization * 10 different word embeddings learnt for words having multiple meanings. ## Evaluation * Dataset * WordSim-353 * 353 pairs of nouns * words represented without context * contains human similarity judgements on pair of words * The paper contributed a new dataset * captures human similarity judgements on pair of words in the context of a sentence * consists of verbs and adjectives along with nouns * for details on how the dataset is constructed, refer the paper * Performance * Proposed model achieves higher correlation to human scores than models using only the local or global context. * Performance can be improved by removing the stop words. * Using multi-prototype approach (multiple vectors for the same word) benefits the model on the tasks where the context is also given. ## Comments * This work predated the more general word embedding models like [Word2Vec](https://gist.github.com/shagunsodhani/176a283e2c158a75a0a6) and [Glove](https://gist.github.com/shagunsodhani/efea5a42d17e0fcf18374df8e3e4b3e8). While this model performs good at intrinsic evaluation tasks like word similarity, it is outperformed by the more general and recent models on downstream tasks like NER. |
[link]
# Skip-Thought Vectors ## Introduction * The paper describes an unsupervised approach to train a generic, distributed sentence encoder. * It also describes a vocabulary expansion method to encode words not seen at training time. * [Link to the paper](https://arxiv.org/abs/1506.06726) ## Skip-Thoughts * Train an encoder-decoder model where the encoder maps the input sentence to a sentence vector and the decoder generates the sentences surrounding the original sentence. * The model is called **skip-thoughts** and the encoded vectors are called **skip-thought vectors.** * Similar to the [skip-gram](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) model in the sense that surrounding sentences are used to learn sentence vectors. ### Architecture * Training data is in form of sentence tuples (previous sentence, current sentence, next sentence). * **Encoder** * RNN Encoder with GRU. * **Decoder** * RNN Decoder with conditional GRU. * Conditioned on encoder output. * Extra matrices introduced to bias the update gate, reset gate and hidden state, given the encoder output. * **Vocabulary matrix (V)** - Weight matrix having one row (vector) for each word in the vocabulary. * Separate decoders for the previous and next sentence which share only **V**. * Given the decoder context **h** (at any time), encoder output, and list of words already generated for the output sentence, the probability of choosing *w* as the next word is proportional to *exp(**V(*word*)h**)* * **Objective** * Sum of the log-probabilities for the forward and backwards sentences conditioned on the encoder output. ## Vocabulary Expansion * Use a model like Word2Vec which can be trained to induce word representations and train it to obtain embeddings for all the words that are likely to be seen by the encoder. * Learn a matrix **W** such that *encoder(word) = cross-product(W, Word2Vec(word))* for all words that are common to both Word2Vec model and encoder model. * Use **W** to generate embeddings for words are not seen during encoder training. ## Dataset * [BookCorpus dataset](https://arxiv.org/abs/1506.06724) having books across 16 genres. ## Training * **uni-skip** * Unidirectional auto-encoder with 2400 dimensions. * **bi-skip** * Bidirectional model with forward (sentence given in correct order) and backward (sentence given in reverse order) encoders of 1200 dimensions each. * **combine-skip** * concatenation of uni-skip and bi-skip vectors. * Initialization * Recurrent matricies - orthogonal initialization. * Non-recurrent matricies - uniform distribution in [-0.1,0.1]. * Mini-batches of size 128. * Gradient Clipping at norm = 10. * Adam optimizer. ## Experiments * After learning skip-thoughts, freeze the model and use the encoder as feature extractor only. * Evaluated the vectors with linear models on following tasks: ### Semantic Relatedness * Given a sentence pair, predict how closely related the two sentences are. * **skip-thoughts** method outperforms all systems from SemEval 2014 competition and is outperformed only by dependency tree-LSTMs. * Using features learned from image-sentence embedding model on COCO boosts performance and brings it at par with dependency tree-LSTMs. ### Paraphrase detection * **skip-thoughts** outperforms recursive nets with dynamic pooling if no hand-crafted features are used. * **skip-thoughts** with basic pairwise statistics produce results comparable with the state-of-the-art systems that house complicated features and hand engineering. ### Image-sentence Ranking * MS COCO dataset * Task * Image annotation * Given an image, rank the sentences on basis of how well they describe the image. * Image search - Given a caption, find the image that is being described. * Though the system does not outperform baseline system in all cases, the results does indicate that skip-thought vectors can capture image descriptions without having to learn their representations from scratch. ### Classification * **skip-thoughts** perform about as good as bag-of-words baselines but are outperformed by methods where sentence representation has been learnt for the task at hand. * Combining **skip-thoughts** with bi-gram Naive Bayes (NB) features improves the performance. ## Future Work * Variants to be explored include: * Fine tuning the encoder-decoder model during the downstream task instead of freezing the weights. * Deep encoders and decoders. * Larger context windows. * Encoding and decoding paragraphs. * Encoders, such as convnets. |
[link]
# Deep Convolutional Generative Adversarial Nets ## Introduction * The paper presents Deep Convolutional Generative Adversarial Nets (DCGAN) - a topologically constrained variant of conditional GAN. * [Link to the paper](https://arxiv.org/abs/1511.06434) ## Benefits * Stable to train * Very useful to learn unsupervised image representations. ## Model * GANs difficult to scale using CNNs. * Paper proposes following changes to GANs: * Replace any pooling layers with strided convolutions (for discriminator) and fractional strided convolutions (for generators). * Remove fully connected hidden layers. * Use batch normalisation in both generator (all layers except output layer) and discriminator (all layers except input layer). * Use LeakyReLU in all layers of the discriminator. * Use ReLU activation in all layers of the generator (except output layer which uses Tanh). ## Datasets * Large-Scale Scene Understanding. * Imagenet-1K. * Faces dataset. ## Hyperparameters * Minibatch SGD with minibatch size of 128. * Weights initialized with 0 centered Normal distribution with standard deviation = 0.02 * Adam Optimizer * Slope of leak = 0.2 for LeakyReLU. * Learning rate = 0.0002, β1 = 0.5 ## Observations * Large-Scale Scene Understanding data * Demonstrates that model scales with more data and higher resolution generation. * Even though it is unlikely that model would have memorized images (due to low learning rate of minibatch SGD). * Classifying CIFAR-10 dataset * Features * Train in Imagenet-1K and test on CIFAR-10. * Max pool discriminator's convolutional features (from all layers) to get 4x4 spatial grids. * Flatten and concatenate to get a 28672-dimensional vector. * Linear L2-SVM classifier trained over the feature vector. * 82.8% accuracy, outperforms K-means (80.6%) * Street View House Number Classifier * Similar pipeline as CIFAR-10 * 22.48% test error. * The paper contains many examples of images generated by final and intermediate layers of the network. * Images in the latent space do not show sharp transitions indicating that network did not memorize images. * DCGAN can learn an interesting hierarchy of features. * Networks seems to have some success in disentangling image representation from object representation. * Vector arithmetic can be performed on the Z vectors corresponding to the face samples to get results like `smiling woman - normal woman + normal man = smiling man` visually. |
[link]
# Generative Adversarial Nets ## Introduction * The paper proposes an adversarial approach for estimating generative models where one model (generative model) tries to learn a data distribution and another model (discriminative model) tries to distinguish between samples from the generative model and original data distribution. * [Link to the paper](https://arxiv.org/abs/1406.2661) ## Adversarial Net * Two models - Generative Model(*G*) and Discriminative Model(*D*) * Both are multi-layer perceptrons. * *G* takes as input a noise variable *z* and outputs data sample *x(=G(z))*. * *D* takes as input a data sample *x* and predicts whether it came from true data or from *G*. * *G* tries to minimise *log(1-D(G(z)))* while *D* tries to maximise the probability of correct classification. * Think of it as a minimax game between 2 players and the global optimum would be when *G* generates perfect samples and *D* can not distinguish between the samples (thereby always returning 0.5 as the probability of sample coming from true data). * Alternate between *k* steps of training *D* and 1 step of training *G* so that *D* is maintained near its optimal solution. * When starting training, the loss *log(1-D(G(z)))* would saturate as *G* would be weak. Instead maximise *log(D(G(z)))* * The paper contains the theoretical proof for global optimum of the minimax game. ## Experiments * Datasets * MNIST, Toronto Face Database, CIFAR-10 * Generator model uses RELU and sigmoid activations. * Discriminator model uses maxout and dropout. * Evaluation Metric * Fit Gaussian Parzen window to samples obtained from *G* and compare log-likelihood. ## Strengths * Computational advantages * Backprop is sufficient for training with no need for Markov chains or performing inference. * A variety of functions can be used in the model. * Since *G* is trained only using the gradients from *D*, fewer chances of directly copying features from the true data. * Can represent sharp (even degenerate) distributions. ## Weakness * *D* must be well synchronised with *G*. * While *G* may learn to sample data points that are indistinguishable from true data, no explicit representation can be obtained. ## Possible Extensions * Conditional generative models. * Inference network to predict *z* given *x*. * Implement a stochastic extension of the deterministic [Multi-Prediction Deep Boltzmann Machines](https://papers.nips.cc/paper/5024-multi-prediction-deep-boltzmann-machines.pdf) * Using discriminator net or inference net for feature selection. * Accelerating training by ensuring better coordination between *G* and *D* or by determining better distributions to sample *z* from during training. |
[link]
# A Roadmap towards Machine Intelligence ## Introduction * The paper presents some general characteristics that intelligent machines should possess and a roadmap to develop such intelligent machines in small, realistic steps. * [Link to the paper](https://arxiv.org/abs/1511.08130) ## Ability to Communicate * The intelligent agents should be able to communicate with humans, preferably using language as the medium. * Such systems can be programmed through natural language and can access much of the human knowledge which is encoded using natural language. * The learning environment should facilitate interactive communication and the machine should have a minimalistic bit interface for IO to keep the interface simple. * Further, the machine should be free to use any internal representation for learning tasks. ## Ability to Learn * Learning allows the machine to adapt to the external environment and correct their mistakes. * Users should be able to control the motivation of the machine via a communication channel. This is similar to the notion of rewards in reinforcement learning. ## A simulated ecosystem to educate communication-based intelligent machines * Simulated environment to teach basic linguistic interactions and know-how to operate in the world. * Though the environment should be challenging enough to force the machine to "learn how to learn", its complexity should be manageable. * Unlike class AI block worlds, the simulated environment is not intended to teach an exhaustive set of functionality to the agent. The aim is to teach the machine how to learn efficiently by combining already acquired skills. ### Description #### Agent * Learner or actor * Teacher * Assigns tasks and rewards to the learner and provides helpful information. * Aim is to kick start the learner's efficient learning capabilities without providing enough direct information. * Environment * Learner explores the environment by giving orders, asking questions and receiving feedback. * Environment uses a controlled language which is more explicit and restricted. Think of learner as a high-level programming language, the teacher as the programmer and the environment as the compiler. #### Interface Channels * Generic input and output channels. * Teacher and environment write to the input channel. * Reward is written to input channel. * Learner writes to the output channel and learns to use ambigous prefixes to address the agents and services it needs to interact with. #### Reward * Way to provide feedback to the learner. * Rewards should become sparse as the learner's intelligence grows and "curiosity" should be a learnt strategy. * Learner should maximise average reward over time so that faster strategies are preferred in case of equal rewards. #### Incremental Structure * Think of learner progressing through different levels where skills from earlier levels can be used in later levels. * Tasks need not be ordered within a level. * Learner starts by performing basic tasks like repeating characters then learns to associate linguistic strings to action sequences. Further, the learner learns to ask questions and "read" natural text. #### Time Off * Learner is given time to either explore the environment or to interact with the Teacher or to update its internal structure by replaying the previous experience. #### Evaluation * Evaluating the learning agent on only the final behaviour only is not sufficient as it overlooks the number of attempts to reach the optimal behaviour. * Better approach would be to conduct public competition where developers have access to preprogrammed environment for fixed amount of time and learners are evaluated on tasks that are considerably different from the tasks encountered during training. #### Tasks A brief overview of the type of tasks is provided [here](https://github.com/facebookresearch/CommAI-env/blob/master/TASKS.md) ## Types of Learning * Concept of positive and negative rewards. * Discovery of algorithms. * Remember facts, skills, and learning strategies. ## Long term memory * To store facts, algorithms and even ability to learn. ## Compositional Learning Skills * Producing new structures by combining together known facts and skills. * Understanding new concepts should not always require training examples. ## Computational properties of intelligent machines * Computational model should be able to represent any pattern in data (alternatively, represent any algorithm in fixed length). * Among the various Turning-complete computational systems available, the most natural choice would be a compositional system that can perform computations in parallel. * Alternatively, a non-growing model with immensely large capacity could be used. * In a growing model, new cells are connected to ones that spawned them leading to topological structures that can contribute to learning. * But it is not clear if such topological structures can arise in a large-capacity unstructured model. |
[link]
# Smart Reply: Automated Response Suggestion for Email ## Introduction * Proposes a novel, end-to-end architecture for generating short email responses. * Single most important benchmark of its success is that it is deployed in Inbox by Gmail and assists with around 10% of all mobile responses. * [Link to the paper.](https://arxiv.org/abs/1606.04870) ## Challenges in deploying Smart Reply in a user-facing product * Responses must always be of high quality. Ensured by constructing a target response set to select responses from. * The likelihood of choosing the responses must be maximised. Ensured by normalising the responses and enforcing diversity. * The system should not add latency to emails. Ensured by using a triggering model to decide if the email is suitable to undergo the response generation pipeline. Computation time is further reduced by finding approximate best result instead of the best result. * Ensure privacy by encrypting all the data which adds challenge in verifying the model's quality and debugging the system. ## Architecture ## Preprocess Email * Perform actions like language detection, tokenization, sentence segmentation etc on the input email. ## Triggering Model * A feed-forward neural network (with embedding layer and 3 fully connected hidden layers) to decide if the input email is suitable for suggesting responses. #### Data * Training set of pairs *(o, y)* where *o* is the incoming message and *y* is a boolean variable to indicate if the message had a response. #### Features * Unigrams, bigrams from the messages. * Signals like - is the recipient in the contact list of the sender. ## Response Selection * LSTM network to predict the approximate best response for an incoming message *o* #### Network * Sequence to Sequence Learning. * Reads the input message (token by token) and encode a vector representation. * Compute softmax to get the probability of first output token given the input token sequence. * Keep feeding in the previous response tokens and the input token sequence to compute the probability of next output token. * During inference, approximate the most likely response greedily by taking the most likely response at each timestamp and feeding it back or by using the beam search approach. ## Response Set Generation * Generate a set of high-quality responses that also capture the variability in the intent of the response. * Canonicalize the email response by extracting the semantic structure using a dependency parser. * Partition all response messages into "semantic" clusters. * These semantic clusters define the response space for scoring and selecting possible responses and for promoting diversity among the responses. ## Semantic Intent Clustering * Since a large, labelled dataset is not available, a graph based, semi-supervised approach is used. #### Graph Construction * Manually define a few clusters with a small number of example responses for each cluster. * Construct a graph with frequent response messages (including the labelled nodes) as response nodes (V<sub>R</sub>). * For each response node, extract a set of feature nodes (V<sub>F</sub>) corresponding to features like skip-gram and n-grams and add an edge between the response node and the feature node. * Learn a semantic labelling for all response nodes by propagating semantic intent information (available because of labelled nodes) throughout the graph. * After some iterations, sample some of the unlabeled nodes from the graph, manually label these sample nodes and repeat this algorithm until convergence. * For validation, extract the top k members of each cluster and validate the quality with help of human evaluators. ## Suggestion Diversity * Provide users with a varied set of response by omitting redundant response (by not selecting more than one response from any semantic cluster) and by enforcing negative (or positive) responses. * If the top two responses contain at least one positive (negative) response and none of the top three responses is negative (positive), the third response is replaced with a negative (positive) one. * This is done by performing a second LSTM pass where the search is restricted to only positive (or negative) responses in the target set. ## Strengths * The system is already in production and assists with around 10% of all mobile responses. |
[link]
#### Introduction * The paper explores the domain of conditional image generation by adopting and improving PixelCNN architecture. * [Link to the paper](https://arxiv.org/abs/1606.05328) #### Based on PixelRNN and PixelCNN * Models image pixel by pixel by decomposing the joint image distribution as a product of conditionals. * PixelRNN uses two-dimensional LSTM while PixelCNN uses convolutional networks. * PixelRNN gives better results but PixelCNN is faster to train. #### Gated PixelCNN * PixelRNN outperforms PixelCNN due to the larger receptive field and because they contain multiplicative units, LSTM gates, which allow modelling more complex interactions. * To account for these, deeper models and gated activation units (equation 2 in the [paper](https://arxiv.org/abs/1606.05328)) can be used respectively. * Masked convolutions can lead to blind spots in the receptive fields. * These can be removed by combining 2 convolutional network stacks: * Horizontal stack - conditions on the current row. * Vertical stack - conditions on all rows above the current row. * Every layer in the horizontal stack takes as input the output of the previous layer as well as that of the vertical stack. * Residual connections are used in the horizontal stack and not in the vertical stack (as they did not seem to improve results in the initial settings). #### Conditional PixelCNN * Model conditional distribution of image, given the high-level description of the image, represented using the latent vector h (equation 4 in the [paper](https://arxiv.org/abs/1606.05328)) * This conditioning does not depend on the location of the pixel in the image. * To consider the location as well, map h to spatial representation $s = m(h)$ (equation 5 in the the [paper](https://arxiv.org/abs/1606.05328)) #### PixelCNN Auto-Encoders * Start with a traditional auto-encoder architecture and replace the deconvolutional decoder with PixelCNN and train the network end-to-end. #### Experiments * For unconditional modelling, Gated PixelCNN either outperforms PixelRNN or performs almost as good and takes much less time to train. * In the case of conditioning on ImageNet classes, the log likelihood measure did not improve a lot but the visual quality of the generated sampled was significantly improved. * Paper also included sample images generated by conditioning on human portraits and by training a PixelCNN auto-encoder on ImageNet patches. |
[link]
#### Introduction * Problem: Building an expressive, tractable and scalable image model which can be used in downstream tasks like image generation, reconstruction, compression etc. * [Link to the paper](https://arxiv.org/abs/1601.06759) #### Model * Scan the image, one row at a time and one pixel at a time (within each row). * Given the scanned content, predict the distribution over the possible values for the next pixel. * Joint distribution over the pixel values is factorised into a product of conditional distributions thus causing the problem as a sequence problem. * Parameters used in prediction are shared across all the pixel positions. * Since each pixel is jointly determined by 3 values (3 colour channels), each channel may be conditioned on other channels as well. ##### Pixel as discrete value * The conditional distributions are multinomial (with channel variable taking 1 of 256 discrete values). * This discrete representation is simpler and easier to learn. #### Pixel RNN ##### Row LSTM * Undirectional layer that processed image row by row. * Uses one-dimensional convolution (kernel of size kx1, k>=3). * Refer image 2 in the [paper](https://arxiv.org/abs/1601.06759). * Weight sharing in convolution ensures translation invariance of computed feature along each row. * For LSTM, the input-to-state component is computed for the entire 2-d input map and then is masked to include only the valid context. * For equations related to state-to-state component, refer to equation 3 in the [paper](https://arxiv.org/abs/1601.06759) ##### Diagonal BiLSTM * Bidirectional layer that processes the image in the diagonal fashion. * Input map skewed by offsetting each row of the image by one position with respect to the previous row. * Refer image 3 in the [paper](https://arxiv.org/abs/1601.06759) * For both directions, the input-to-state component is a 1 x 1 convolution while the state-to-state recurrent component is computed with column wise convolution using kernel size 2x1. * Kernel size of 2x1 processes minimal information yielding a highly non-linear computation. * Output map is skewed back by removing the offset positions. * To prevent layers from seeing further pixels, the right output map is shifted down by one row and added to left output map. ##### Residual Connections * Residual connections (or skip connections) are used to increase convergence speed and to propagate signals more explicitly. * Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759) ##### Masked Convolutions * Masks are used to enforce certain restrictions on the connections in the network (eg when predicting values for R channel, values of B channel can not be used). * Mask A is applied to first convolution layer and restricts connections to only those neighbouring pixels and colour channels that have already been seen. * Mask B is applied to all subsequent input-to-state convolution transactions and allows connections from a colour channel to itself. * Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759) ##### PixelCNN * Uses multiple convolution layers that preserve spatial resolution. * Makes receptive field large but not unbounded. * Mask used to avoid seeing the future context. * Faster that PixelRNN at training or evaluation time (as convolutions can be parallelized easily). ##### Multi-Scale PixelRNN * Composed of one unconditional PixelRNN and multiple conditional PixelRNNs. * Unconditional network generates a smaller s x s image which is fed as input to the conditional PixelRNN. (n is a multiple of s) * Conditional PixelRNN is a standard PixelRNN with layers biased with an upsampled version of the s x s image. * For upsampling, a convolution network with deconvolution layers constructs an enlarged feature map of size c x n x n. * For biasing, the c x n x n map is mapped to 4hxnxn map (using 1x1 unmasked convolution) and added to input-to-state map. #### Training and Evaluation * Pixel values are dequantized using real-valued noise and log likelihood of continuous and discrete models are compared. * Update rule - RMSProp * Batch size - 16 for MNIST and CIFAR 10 and 32(or 64) for IMAGENET. * Residual connections are as effective as Skip connections, in fact, the 2 can be used together as well. * PixelRNN outperforms other models for Binary MNIST and CIFAR10. * For CIFAR10, Diagonal BiLSTM > Row LSTM > PixelCNN. This is also the order of receptive field for the 3 architectures and the observation underlines the importance of having a large receptive field. * The paper also provides new benchmarks for generative image modelling on IMAGENET dataset. |
[link]
#### Introduction * The paper presents gradient computation based techniques to visualise image classification models. * [Link to the paper](https://arxiv.org/abs/1312.6034) #### Experimental Setup * Single deep convNet trained on ILSVRC-2013 dataset (1.2M training images and 1000 classes). * Weight layer configuration is: conv64-conv256-conv256-conv256-conv256-full4096-full4096-full1000. #### Class Model Visualisation * Given a learnt ConvNet and a class (of interest), start with the zero image and perform optimisation by back propagating with respect to the input image (keeping the ConvNet weights constant). * Add the mean image (for training set) to the resulting image. * The paper used unnormalised class scores so that optimisation focuses on increasing the score of target class and not decreasing the score of other classes. #### Image-Specific Class Saliency Visualisation * Given an image, class of interest, and trained ConvNet, rank the pixels of the input image based on their influence on class scores. * Derivative of the class score with respect to image gives an estimate of the importance of different pixels for the class. * The magnitude of derivative also indicated how much each pixel needs to be changed to improve the class score. ##### Class Saliency Extraction * Find the derivative of the class score with respect with respect to the input image. * This would result in one single saliency map per colour channel. * To obtain a single saliency map, take the maximum magnitude of derivative across all colour channels. ##### Weakly Supervised Object Localisation * The saliency map for an image provides a rough encoding of the location of the object of the class of interest. * Given an image and its saliency map, an object segmentation map can be computed using GraphCut colour segmentation. * Color continuity cues are needed as saliency maps might capture only the most dominant part of the object in the image. * This weakly supervised approach achieves 46.4% top-5 error on the test set of ILSVRC-2013. #### Relation to Deconvolutional Networks * DeconvNet-based reconstruction of the $n^{th}$ layer input is similar to computing the gradient of the visualised neuron activity $f$ with respect to the input layer. * One difference is in the way RELU neurons are treated: * In DeconvNet, the sign indicator (for the derivative of RELU) is computed on output reconstruction while in this paper, the sign indicator is computed on the layer input. |
[link]
#### Introduction * Introduces fastText, a simple and highly efficient approach for text classification. * At par with deep learning models in terms of accuracy though an order of magnitude faster in performance. * [Link to the paper](http://arxiv.org/abs/1607.01759v3) * [Link to code](https://github.com/facebookresearch/fastText) #### Architecture * Built on top of linear models with a rank constraint and a fast loss approximation. * Start with word representations that are averaged into text representation and feed them to a linear classifier. * Think of text representation as a hidden state that can be shared among features and classes. * Softmax layer to obtain a probability distribution over pre-defined classes. * High computational complexity $O(kh)$, $k$ is the number of classes and $h$ is dimension of text representation. ##### Hierarchial Softmax * Based on Huffman Coding Tree * Used to reduce complexity to $O(hlog(k))$ * Top T results (from the tree) can be computed efficiently $O(logT)$ using a binary heap. ##### N-gram Features * Instead of explicitly using word order, uses a bag of n-grams to maintain efficiency without losing on accuracy. * Uses [hashing trick](https://arxiv.org/pdf/0902.2206.pdf) to maintain fast and memory efficient mapping of the n-grams. #### Experiments ##### Sentiment Analysis * fastText benefits by using bigrams. * Outperforms [char-CNN](http://arxiv.org/abs/1502.01710v5) and [char-CRNN](http://arxiv.org/abs/1602.00367v1) and performs a bit worse than [VDCNN](http://arxiv.org/abs/1606.01781v1). * Order of magnitudes faster in terms of training time. * Note: fastText does not use pre-trained word embeddings. ##### Tag Prediction * fastText with bigrams outperforms [Tagspace](http://emnlp2014.org/papers/pdf/EMNLP2014194.pdf). * fastText performs upto 600 times faster at test time. |
[link]
#### Introduction * Introduces a new global log-bilinear regression model which combines the benefits of both global matrix factorization and local context window methods. #### Global Matrix Factorization Methods * Decompose large matrices into low-rank approximations. * eg - Latent Semantic Analysis (LSA) ##### Limitations * Poor performance on word analogy task * Frequent words contribute disproportionately high to the similarity measure. #### Shallow, Local Context-Based Window Methods * Learn word representations using adjacent words. * eg - Continous bag-of-words (CBOW) model and skip-gram model. ##### Limitations * Since they do not operate directly on the global co-occurrence counts, they can not utilise the statistics of the corpus effectively. #### GloVe Model * To capture the relationship between words $i$ and $j$, word vector models should use ratios of co-occurene probabilites (with other words $k$) instead of using raw probabilites themselves. * In most general form: * $F(w_{i}, w_{j}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * We want $F$ to encode information in the vector space (which have a linear structure), so we can restrict to the difference of $w_{i}$ and $w_{j}$ * $F(w_{i} - w_{j}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * Since right hand side is a scalar and left hand side is a vector, we take dot product of the arguments. * $F( (w_{i} - w_{j})^{T}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * *F* should be invariant to order of the word pair $i$ and $j$. * $F(w_{i}^{T}w_{k}^{~}) = P_{ik}$ * Doing further simplifications and optimisations (refer paper), we get cost function, * $J = \sum_{\text{over all i, j pairs in the vocabulary}}[w_{i}^{T}w_{k}^{~} + b_{i} + b_{k}^{~} - log(X_{ik})]^{2}$ * $f$ is a weighing function. * $f(x) = min((x/x_{max})^{\alpha}, 1)$ * Typical values, $x_{\max} = 100$ and $\alpha = 3/4$ * *b* are the bias terms. ##### Complexity * Depends on a number of non-zero elements in the input matrix. * Upper bound by the square of vocabulary size * Since for shallow window-based approaches, complexity depends on $|C|$ (size of the corpus), tighter bounds are needed. * By modelling number of co-occurrences of words as power law function of frequency rank, the complexity can be shown to be proportional to $|C|^{0.8}$ #### Evaluation ##### Tasks * Word Analogies * a is to b as c is to ___? * Both semantic and syntactic pairs * Find closest d to $w_{b} - w_{c} + w_{a}$ (using cosine similarity) * Word Similarity * Named Entity Recognition ##### Datasets * Wikipedia Dumps - 2010 and 2014 * Gigaword5 * Combination of Gigaword5 and Wikipedia2014 * CommonCrawl * 400,000 most frequent words considered from the corpus. ##### Hyperparameters * Size of context window. * Whether to distinguish left context from right context. * $f$ - Word pairs that are $d$ words apart contribute $1/d$ to the total count. * $xmax = 100$ * $\alpha = 3/4$ * AdaGrad update ##### Models Compared With * Singular Value Decomposition * Continous Bag-Of-Words * Skip-Gram ##### Results * Glove outperforms all other models significantly. * Diminishing returns for vectors larger than 200 dimensions. * Small and asymmetric context windows (context window only to the left) works better for syntactic tasks. * Long and symmetric context windows (context window to both the sides) works better for semantic tasks. * Syntactic task benefited from larger corpus though semantic task performed better with Wikipedia instead of Gigaword5 probably due to the comprehensiveness of Wikipedia and slightly outdated nature of Gigaword5. * Word2vec’s performance decreases if the number of negative samples increases beyond about 10. * For the same corpus, vocabulary, and window size GloVe consistently achieves better results, faster. |
[link]
#### Introduction * Algorithm to derive similarity between 2 nodes of a graph (or graphical model derived from any other kind of dataset). * [Link to the paper](http://dl.acm.org/citation.cfm?id=775126) #### SimRank * Input: A directed graph $G = (V, E)$ where $V$ represents vertices and $E$ represents edges. * SimRank defines similarity between 2 vertices (or nodes) $i$ and $j$ as the average of the similarity between their in-neighbours decayed by a constant factor $C$. * Any node is maximally similar to itself (with similarity = 1). * PageRank analyses the individual vertices of the graph with respect to the global structure, while SimRank analyses the relationship between a pair of vertices (edges). * SimRank scores are symmetric and can be defined between all pair of vertices. * $G^{2}$ is defined as the node pair graph such that each node in *G^{2}* corresponds to an ordered pair of nodes of $G$ and there exists an edge between node pair (a, b) and (c, d) if there exists an edge between (a, c) and (b, d). * In $G^{2}$, similarity flows from node to node with singleton nodes (nodes of the form (a, a)) as the source of similarity. #### Variants ##### Minimax Variant * Defines similarity of nodes $i$ and $j$ as the minimum of maximum similarity between $i$ and any in-neighbour of $j$ and between $j$ and any in-neighbour of $i$. #### Computing SimRank * A naive solution can be obtained by iteration to a fixed point. * Space complexity is $O(n^{2})$ and time complexity is $O(kn^{2}d)$ where $k$ is the number of iterations, $n$ is the number of vertices and $d$ is the average of product of indegrees of pair of vertices. * Optimisations can be made by setting the similarity between far off nodes as 0 and considering only nearby nodes for an update. #### Different Interpretations ##### Co-citation Score * The first iteration of SimRank produces results same as co-citation score between a pair of vertices. * Successive iterations improve these initial scores. ##### Random Surfer-Pairs Model * SimRank $s(a, b)$ can be interpreted as the measure of how soon two random surfers are expected to meet at the same node if they start at nodes a and b and walk the graph backwards. * Expected Meeting Distance (EMD) between 2 nodes a and b is the expected number of steps required before 2 surfers (starting at a and b) would meet if they walked randomly in locked step. * Surfers are allowed to teleport with a small probability - to circumvent the infinite EMD problem. * Expected-f Meeting Distance (EMD) - Given length l of a tour, compute f(l) (where f is a non-negative monotonic function) to bound the expected distance to a finite interval. * Common choice for f is $f(z) = C^{z}$ where $C \in (0, 1)$ * The SimRank score for two nodes, with parameter $C$, is the expected-f meeting distance travelling back-edges with $f(z) = C^{z}$ #### Evaluation * Experiments on 2 datasets: * Corpus of scientific research papers from ResearchIndex. * Transcripts of undergrad students at Stanford. * Domain specific properties used to measure similarity and compared with SimRank scores. * Results show improvement over co-citation scores. |
[link]
#### Introduction * The paper explores the strengths and weaknesses of different evaluation metrics for end-to-end dialogue systems(in unsupervised setting). * [Link to the paper](https://arxiv.org/abs/1603.08023) #### Evaluation Metrics Considered ##### Word Based Similarity Metric ###### BLEU * Analyses the co-occurrences of n-grams in the ground truth and the proposed responses. * BLEU-N: N-gram precision for the entire dataset. * Brevity penalty added to avoid bias towards short sentences. ###### METEOR * Create explicit alignment between candidate and target response (using Wordnet, stemmed token etc). * Compute the harmonic mean of precision and recall between proposed and ground truth. ###### ROGUE * F-measure based on Longest Common Subsequence (LCS) between candidate and target response. ##### Embedding Based Metric ###### Greedy Matching * Each token in actual response is greedily matched with each token in predicted response based on cosine similarity of word embedding (and vice-versa). * Total score is averaged over all words. ###### Embedding Average * Calculate sentence level embedding by averaging word level embeddings * Compare sentence level embeddings between candidate and target sentences. ###### Vector Extrema * For each dimension in the word vector, take the most extreme value amongst all word vectors in the sentence, and use that value in the sentence-level embedding. * Idea is that by taking the maxima along each dimension, we can ignore the common words (which will be pulled towards the origin in the vector space). #### Dialogue Models Considered ##### Retrieval Models ###### TF-IDF * Compute the TF-IDF vectors for each context and response in the corpus. * C-TFIDF computes the cosine similarity between an input context and all other contexts in the corpus and returns the response with the highest score. * R-TFIDF computes the cosine similarity between the input context and each response directly. ###### Dual Encoder * Two RNNs which respectively compute the vector representation of the input context and response. * Then calculate the probability that given response is the ground truth response given the context. ##### Generative Models ###### LSTM language model * LSTM model trained to predict the next word in the (context, response) pair. * Given a context, model encodes it with the LSTM and generates a response using a greedy beam search procedure. ###### Hierarchical Recurrent Encoder-Decoder (HRED) * Uses a hierarchy of encoders. * Each utterance in the context passes through an ‘utterance-level’ encoder and the output of these encoders is passed through another 'context-level' decoder. * Better handling of long-term dependencies as compared to the conventional Encoder-Decoder. #### Observations * Human survey to determine the correlation between human judgement on the quality of responses, and the score assigned by each metric. * Metrics (especially BLEU-4 and BLEU-3) correlate poorly with human evaluation. * Best performing metric: * Using word-overlaps - BLEU-2 score * Using word embeddings - vector average * Embedding-based metrics would benefit from a weighting of word saliency. * BLEU could still be a good evaluation metric in constrained tasks like mapping dialogue acts to natural language sentences. |
[link]
#### Introduction * Task of translating natural language queries into regular expressions without using domain specific knowledge. * Proposes a methodology for collecting a large corpus of regular expressions to natural language pairs. * Reports performance gain of 19.6% over state-of-the-art models. * [Link to the paper](http://arxiv.org/abs/1608.03000v1) #### Architecture * LSTM based sequence to sequence neural network (with attention) * Six layers * One-word embedding layer * Two encoder layers * Two decoder layers * One dense output layer. * Attention over encoder layer. * Dropout with the probability of 0.25. * 20 epochs, minibatch size of 32 and learning rate of 1 (with decay rate of 0.5) #### Dataset Generation * Created a public dataset - **NL-RX** - with 10K pair of (regular expression, natural language) * Two step generate-and-paraphrase approach * Generate step * Use handcrafted grammar to translate regular expressions to natural language. * Paraphrase step * Crowdsourcing the task of translating the rigid descriptions into more natural expressions. #### Results * Evaluation Metric * Functional equality check (called DFA-Equal) as same regular expression could be written in many ways. * Proposed architecture outperforms both the baselines - Nearest Neighbor classifier using Bag of Words (BoWNN) and Semantic-Unify |
[link]
#### Introduction * Large scale natural language understanding task - predict text values given a knowledge base. * Accompanied by a large dataset generated using Wikipedia * [Link to the paper](http://www.aclweb.org/anthology/P/P16/P16-1145.pdf) #### Dataset * WikiReading dataset built using Wikidata and Wikipedia. * Wikidata consists of statements of the form (property, value) about different items * 80M statements, 16M items and 884 properties. * These statements are grouped by items to get (item, property, answer) tuples where the answer is a set of values. * Items are further replaced by their Wikipedia documents to generate 18.58M statements of the form (document, property, answer). * Task is to predict answer given document and property. * Properties are divided into 2 classes: * **Categorical properties** - properties with a small number of possible answers. Eg gender. * **Relational properties** - properties with unique answers. Eg date of birth. * This classification is done on the basis of the entropy of answer distribution. * Properties with entropy less than 0.7 are classified as categorical properties. * Answer distribution has a small number of very high-frequency answers (head) and a large number of answers with very small frequency (tail). * 30% of the answers do not appear in the training set and must be inferred from the document. #### Models ##### Answer Classification * Consider WikiReading as classification task and treat each answer as a class label. ###### Baseline * Linear model over Bag of Words (BoW) features. * Two BoW vectors computed - one for the document and other for the property. These are concatenated into a single feature vector. ###### Neural Networks Method * Encode property and document into a joint representation which is fed into a softmax layer. * **Average Embeddings BoW** * Average the BoW embeddings for documents and property and concatenate to get joint representation. * **Paragraph Vectors** * As a variant of the previous method, encode document as a paragraph vector. * **LSTM Reader** * LSTM reads the property and document sequence, word-by-word, and uses the final state as joint representation. * **Attentive Reader** * Use attention mechanism to focus on relevant parts of the document for a given property. * **Memory Networks** * Maps a property p and list of sentences x<sub>1</sub>, x<sub>2</sub>, ...x<sub>n</sub> in a joint representation by attention over the sentences in the document. ##### Answer Extraction * For relational properties, it makes more sense to model the problem as information extraction than classification. * **RNNLabeler** * Use an RNN to read the sequence of words and estimate if a given word is part of the answer. * **Basic SeqToSeq (Sequence to Sequence)** * Similar to LSTM Reader but augmented with a second RNN to decode answer as a sequence of words. * **Placeholder SeqToSeq** * Extends Basic SeqToSeq to handle OOV (Out of Vocabulary) words by adding placeholders to the vocabulary. * OOV words in the document and answer are replaced by placeholders so that input and output sentences are a mixture of words and placeholders only. * **Basic Character SeqToSeq** * Property encoder RNN reads the property, character-by-character and transforms it into a fixed length vector. * This becomes the initial hidden state for the second layer of a 2-layer document encoder RNN. * Final state of this RNN is used by answer decoder RNN to generate answer as a character sequence. * **Character SeqToSeq with pretraining** * Train a character-level language model on input character sequence from the training set and use the weights to initiate the first layer of encoder and decoder. #### Experiments * Evaluation metric is F1 score (harmonic mean of precision and accuracy). * All models perform well on categorical properties with neural models outperforming others. * In the case of relational properties, SeqToSeq models have a clear edge. * SeqToSeq models also show a great deal of balance between relational and categorical properties. * Language model pretraining enhances the performance of character SeqToSeq approach. * Results demonstrate that end-to-end SeqToSeq models are most promising for WikiReading like tasks. |
[link]
#### Introduction * Presents WikiQA - a publicly available set of question and sentence pairs for open-domain question answering. * [Link to the paper](https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/) #### Dataset * 3047 questions sampled from Bing query logs. * Each question associated with a Wikipedia page. * All sentences in the summary paragraph of the page become the candidate answers. * Only 1/3rd questions have a correct answer in the candidate answer set. * Solutions crowdsourced through MTurk like platform. * Answer sentences are associated with *answer phrases* (shortest substring of a sentence that answers the question) though this annotation is not used in the experiments reported by the paper. #### Other Datasets * [QASent datset](http://homes.cs.washington.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf) * Uses questions from TREC-QA dataset (questions from both query logs and human editors) and selects sentences which share at least one non-stopword from the question. * Lexical overlap makes QA task easier. * Does not support evaluating for *answer triggering* (detecting if the correct answer even exists in the candidate sentences). #### Experiments ##### Baseline Systems * **Word Count** - Counts the number of non-stopwords common to question and answer sentences. * **Weighted Word Count** - Re-weight word counts by the IDF values of the question words. * **[LCLR](https://www.microsoft.com/en-us/research/publication/question-answering-using-enhanced-lexical-semantic-models/)** - Uses rich lexical semantic features like WordNet and vector-space lexical semantic models. * **Paragraph Vectors** - Considers cosine similarity between question vector and sentence vector. * **Convolutional Neural Network (CNN)** - Bigram CNN model with average pooling. * **PV-Cnt** and **CNN-Cnt** - Logistic regression classifier combining PV (and CNN) models and Word Count models. ##### Metrics * MAP and MRR for answer selection problem. * Precision, recall and F1 scores for answer triggering problem. #### Observations * CNN-cnt outperforms all other models on both the tasks. * Three additional features, namely the length of the question (QLen), the length of sentence (SLen), and the class of the question (QClass) are added to track question hardness and sentence comprehensiveness. * Adding QLen improves performance significantly while adding SLen (QClass) improves (degrades) performance marginally. * For the same model, the performance on the WikiQA dataset is inferior to that on the QASent dataset. * Note: The dataset is very small to train end-to-end networks. |
[link]
#### Introduction * Build a supervised reading comprehension data set using news corpus. * Compare the performance of neural models and state-of-the-art natural language processing model on reading comprehension task. * [Link to the paper](http://arxiv.org/abs/1506.03340v3) #### Reading Comprehension * Estimate conditional probability $p(a|c, q)$, where $c$ is a context document, $q$ is a query related to the document, and $a$ is the answer to that query. #### Dataset Generation * Use online newspapers (CNN and DailyMail) and their matching summaries. * Parse summaries and bullet points into Cloze style questions. * Generate corpus of document-query-answer triplets by replacing one entity at a time with a placeholder. * Data anonymized and randomised using coreference systems, abstract entity markers and random permutation of the entity markers. * The processed data set is more focused in terms of evaluating reading comprehension as models can not exploit co-occurrence. #### Models ##### Baseline Models * **Majority Baseline** * Picks the most frequently observed entity in the context document. * **Exclusive Majority** * Picks the most frequently observed entity in the context document which is not observed in the query. ##### Symbolic Matching Models * **Frame-Semantic Parsing** * Parse the sentence to find predicates to answer questions like "who did what to whom". * Extracting entity-predicate triples $(e1,V, e2)$ from query $q$ and context document $d$ * Resolve queries using rules like `exact match`, `matching entity` etc. * **Word Distance Benchmark** * Align placeholder of Cloze form questions with each possible entity in the context document and calculate the distance between the question and the context around the aligned entity. * Sum the distance of every word in $q$ to their nearest aligned word in $d$ ##### Neural Network Models * **Deep LSTM Reader** * Test the ability of Deep LSTM encoders to handle significantly longer sequences. * Feed the document query pair as a single large document, one word at a time. * Use Deep LSTM cell with skip connections from input to hidden layers and hidden layer to output. * **Attentive Reader** * Employ attention model to overcome the bottleneck of fixed width hidden vector. * Encode the document and the query using separate bidirectional single layer LSTM. * Query encoding is obtained by concatenating the final forward and backwards outputs. * Document encoding is obtained by a weighted sum of output vectors (obtained by concatenating the forward and backwards outputs). * The weights can be interpreted as the degree to which the network attends to a particular token in the document. * Model completed by defining a non-linear combination of document and query embedding. * **Impatient Reader** * As an add-on to the attentive reader, the model can re-read the document as each query token is read. * Model accumulates the information from the document as each query token is seen and finally outputs a joint document query representation in the form of a non-linear combination of document embedding and query embedding. #### Result * Attentive and Impatient Readers outperform all other models highlighting the benefits of attention modelling. * Frame-Semantic pipeline does not scale to cases where several methods are needed to answer a query. * Moreover, they provide poor coverage as a lot of relations do not adhere to the default predicate-argument structure. * Word Distance approach outperformed the Frame-Semantic approach as there was significant lexical overlap between the query and the document. * The paper also includes heat maps over the context documents to visualise the attention mechanism. |
[link]
#### Introduction * The paper presents a suite of benchmark tasks to evaluate end-to-end dialogue systems such that performing well on the tasks is a necessary (but not sufficient) condition for a fully functional dialogue agent. * [Link to the paper](https://research.facebook.com/publications/evaluating-prerequisite-qualities-for-learning-end-to-end-dialog-systems/) #### Dataset * Created using large-scale real-world sources - OMDB (Open Movie Database), MovieLens and Reddit. * Consists of ~75K movie entities and ~3.5M training examples. #### Tasks ##### QA Task * Answering Factoid Questions without relation to the previous dialogue. * KB(Knowledge Base) created using OMDB and stored as triplets of the form (Entity, Relation, Entity). * Question (in Natural Language Form) generated by creating templates using [SimpleQuestions](https://arxiv.org/abs/1506.02075) * Instead of giving out just 1 response, the system ranks all the answers in order of their relevance. ##### Recommendation Task * Providing personalised responses to the user via recommendation instead of providing universal facts as in case 1. * MovieLens dataset with a *user x item* matrix of ratings. * Statements (for any user) are generated by sampling highly ranked movies by the user and forming a statement about these movies using natural language templates. * Like the previous case, a list of ranked responses is generated. ##### QA + Recommendation Task * Maintaining short dialogues involving both factoid and personalised content. * Dataset consists of short conversations of 3 exchanges (3 from each participant). ##### Reddit Discussion Task * Identify most likely response is discussions on Reddit. * Data processed to flatten the potential conversation so that it appears to be a two participant conversation. ##### Joint Task * Combines all the previous tasks into one single task to test all the skills at once. #### Models Tested * **Memory Networks** - Comprises of a memory component that includes both long term memory and short term context. * **Supervised Embedding Models** - Sum the word embeddings of the input and the target independently and compare them with a similarity metric. * **Recurrent Language Models** - RNN, LSTM, SeqToSeq * **Question Answering Systems** - Systems that answer natural language questions by converting them into search queries over a KB. * **SVD(Singular Value Decomposition)** - Standard benchmark for recommendation. * **Information Retrieval Models** - Given a message, find the most similar message in the training dataset and report its output or find a most similar response to input directly. #### Result ##### QA Task * QA System > Memory Networks > Supervised Embeddings > LSTM ##### Recommendation Task * Supervised Embeddings > Memory Networks > LSTM > SVD ##### Task Involving Dialog History * QA + Recommendation Task and Reddit Discussion Task * Memory Networks > Supervised Embeddings > LSTM ##### Joint Task * Supervised word embeddings perform very poorly even when using a large number of dimensions (2000 dimensions). * Memory Networks perform better than embedding models as they can utilise the local context and the long-term memory. But they do not perform as well on standalone QA tasks. |
[link]
#### Introduction * The paper explains how to apply dropout to LSTMs and how it could reduce overfitting in tasks like language modelling, speech recognition, image caption generation and machine translation. * [Link to the paper](https://arxiv.org/abs/1409.2329) #### [Dropout](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf) * Regularisation method that drops out (or temporarily removes) units in a neural network. the network, along with all its incoming and outgoing connections * Conventional dropout does not work well with RNNs as the recurrence amplifies the noise and hurts learning. #### Regularization * The paper proposes to apply dropout to only the non-recurrent connections. * The dropout operator would corrupt information carried by some units (and not all) forcing them to perform intermediate computations more robustly. * The information is corrupted L+1 times where L is the number of layers and is independent of timestamps traversed by the information. #### Observation * In the context of language modelling, image caption generation, speech recognition and machine translation, dropout enables training larger networks and reduces the testing error in terms of perplexity and frame accuracy. |
[link]
#### Introduction * Automated Theorem Proving (ATP) - Attempting to prove mathematical theorems automatically. * Bottlenecks in ATP: * **Autoformalization** - Semantic or formal parsing of informal proofs. * **Automated Reasoning** - Reasoning about already formalised proofs. * Paper evaluates the effectiveness of neural sequence models for premise selection (related to automated reasoning) without using hand engineered features. * [Link to the paper](https://arxiv.org/abs/1606.04442) #### Premise Selection * Given a large set of premises P, an ATP system A with given resource limits, and a new conjecture C, predict those premises from P that will most likely lead to an automatically constructed proof of C by A #### Dataset * Mizar Mathematical Library (MML) used as the formal corpus. * The premise length varies from 5 to 84299 characters and over 60% if the words occur fewer than 10 times (rare word problem). #### Approach * The model predicts the probability that a given axiom is useful for proving a given conjecture. * Conjecture and axiom sequences are separately embedded into fixed length real vectors, then concatenated and passed to a third network with few fully connected layers and logistic loss. * The two embedding networks and the joint predictor path are trained jointly. ##### Stage 1: Character-level Models * Treat premises on character level using an 80-dimensional one hot encoding vector. * Architectures for embedding: * pure recurrent LSTM and GRU Network * CNN (with max pooling) * Recurrent-convolutional network that shortens the sequence using convolutional layer before feeding it to LSTM. ##### Stage 2: Word-level Models * MML dataset contains both implicit and explicit definitions. * To avoid manually encoding the implicit definitions, the entire statement defining an identifier is embedded and the definition embeddings are used as word level embeddings. * This is better than recursively expanding and embedding the word definition as the definition chains can be very deep. * Once word level embeddings are obtained, the architecture from Character-level models can be reused. #### Experiments ##### Metrics * For each conjecture, the model ranks the possible premises. * Primary metric is the number of conjectures proved from top-k premises. * Average Max Relative Rank (AMMR) is more sophisticated measure based on the motivation that conjectures are easier to prove if all their dependencies occur earlier in ranking. * Since it is very costly to rank all axioms for a conjecture, an approximation is made and a fixed number of random false dependencies are used for evaluating AMMR. ##### Network Training * Asynchronous distributed stochastic gradient descent using Adam optimizer. * Clipped vs Non-clipped Gradients. * Max Sequence length of 2048 for character-level sequences and 500 for word-level sequences. ##### Results * Best Selection Pipeline - Stage 1 character-level CNN which produces word-level embeddings for the next stage. * Best models use simple CNNs followed by max pooling and two-stage definition-based def-CNN outperforms naive word-CNN (where word embeddings are learnt in a single pass). |
[link]
#### Introduction * The paper presents a domain agnostic approach for conversational modelling based on [Sequence to Sequence Learning Framework](https://gist.github.com/shagunsodhani/e3608ccf262d6e5a6b537128c917c92c). * [Link to the paper](http://arxiv.org/abs/1506.05869) #### Model * Neural Conversational Model (NCM) * A Recurrent Neural Network (RNN) reads the input sentence, one token at a time, and predicts the output sequence, one token at a time. * Learns by backpropagation. * The model maximises the cross entropy of correct sequence given its context. * Greedy inference approach where predicted output token is used as input to predict the next output token. #### Dataset * IT HelpDesk dataset of conversations about computer related issues. * OpenSubtitles dataset containing movie conversations. #### Results * The paper has reported some samples of conversations generated by the interaction between human actor and the NCM. * NCM reports lower perplexity as compared to n-grams model. * NCM outperforms CleverBot in a subjective test involving human evaluators to grade the two systems. #### Strengths * Domain-agnostic. * End-To-End training without handcrafted rules. * Underlying architecture (Sequence To Sequence Framework) can be leveraged for machine translation, question answering etc. #### Weakness * The responses are simple, short and at times inconsistent. * The objective function of Sequence To Sequence Framework is not designed to capture the objective of conversational models. |
[link]
#### Introduction * Recurrent Neural Networks (RNNs) are very powerful at modelling sequences but they are not good at learning long-term dependencies. * The paper discusses the reasons behind this difficulty and some suggestions to mitigate it. * [Link to the paper.](https://arxiv.org/abs/1212.0901) #### Optimization Difficulty * RNNs form a deterministic state variable h<sup>t</sup> as function of input observation and previous state. * Learnable parameters to decide what will be remembered about the past sequence. * Using local optimisation techniques like Stochastic Gradient Descent (SGD) are unlikely to find optimal values of tunable parameters * When computations performed by RNN are unfolded through time, a deep Neural Network with shared weights is realised. * The cost function of this deep network depends on the output of hidden layers. * Gradient descent updates could "explode" (become very large) or "vanish" (become very small). #### Training Recurrent Networks * **Clip Gradient** - when the norm of the gradient vector ($g$) is above a threshold, update is done in direction of threshold $g/||g||$. This normalisation implements a simple form of second-order normalisation (the second-order derivate will also be large in regions of exploding gradient). * Use a **leaky integration** state-to-state map: $h_{t, i} = \alpha_{i}h_{t-1, i} + (1-\alpha _{i})F_{i}(h_{t-1}, x_{t})$ Different values of α allow a different amount of the previous state to "leak" through the unfolded layers to further in time. This simply expands the time-scale of vanishing gradients and not totally remove them. * Use **output probability models** like Restricted Boltzmann Machine or NADE to capture higher order dependencies between variables in case of multivariate prediction. * By using **rectifier non-linearities**, the gradient on hidden units becomes sparse and these sparse gradients help the hidden units to specialise. The basic idea is that if the gradient is concentrated in fewer paths (in the unfolded computational graph) the vanishing gradient effect would be limited. * A **simplified Nesterov Momentum** rule is proposed to allow storing past velocities for a longer time while actually using these velocities more conservatively. The new formulation is also easier to implement. #### Results * SGD with these optimisations outperforms a vanilla SGD. |
[link]
#### Introduction * **Machine Comprehension (MC)** - given a natural language sentence, answer a natural language question. * **End-To-End MC** - can not use language resources like dependency parsers. The only supervision during training is the correct answer. * **Query Regression Network (QRN)** - Variant of Recurrent Neural Network (RNN). * [Link to the paper](http://arxiv.org/abs/1606.04582) #### Related Work * Long Short-Term Memory (LSTM) and Gated Recurrence Unit (GRU) are popular choices to model sequential data but perform poorly on end-to-end MC due to long-term dependencies. * Attention Models with shared external memory focus on single sentences in each layer but the models tend to be insensitive to the time step of the sentence being accessed. * **Memory Networks (and MemN2N)** * Add time-dependent variable to the sentence representation. * Summarize the memory in each layer to control attention in the next layer. * **Dynamic Memory Networks (and DMN+)** * Combine RNN and attention mechanism to incorporate time dependency. * Uses 2 GRU * time-axis GRU - Summarize the memory in each layer. * layer-axis GRU - Control the attention in each layer. * QRN is a much simpler model without any memory summarized node. #### QRN * Single recurrent unit that updates its internal state through time and layers. * Inputs * $q_{t}$ - local query vector * $x_{t}$ - sentence vector * Outputs * $h_{t}$ - reduced query vector * $x_{t}$ - sentence vector without any modifications * Equations * $z_{t} = \alpha (x_{t}, q_{t})$ * &alpha is the **update gate function** to measure the relevance between input sentence and local query. * $h`_{t} = \gamma (x_{t}, q_{t})$ * &gamma is the **regression function** to transform the local query into regressed query. * $h_{t} = z_{t} \* h'_{t} + (1 - z_{t}) \* h_{t-1}$ * To create a multi layer model, output of current layer becomes input to the next layer. #### Variants * **Reset gate function** ($r_{t}$) to reset or nullify the regressed query $h`_{t}$ (inspired from GRU). * The new equation becomes $h_{t} = z_{t}\*r_{t}\* h`_{t} + (1 - z_{t})\*h_{t-1}$ * **Vector gates** - update and reset gate functions can produce vectors instead of scalar values (for finer control). * **Bidirectional** - QRN can look at both past and future sentences while regressing the queries. * $q_{t}^{k+1} = h_{t}^{k, \text{forward}} + h_{t}^{k, \text{backward}}$. * The variables of update and regress functions are shared between the two directions. #### Parallelization * Unlike most RNN based models, recurrent updates in QRN can be computed in parallel across time. * For details and equations, refer the [paper](http://arxiv.org/abs/1606.04582). #### Module Details ##### Input Modules * A trainable embedding matrix A is used to encode the one-hot vector of each word in the input sentence into a d-dimensional vector. * Position Encoder is used to obtain the sentence representation from the d-dimensional vectors. * Question vectors are also obtained in a similar manner. ##### Output Module * A V-way single-layer softmax classifier is used to map predicted answer vector $y$ to a V-dimensional sparse vector $v$. * The natural language answer $y is the arg max word in $v$. #### Results * [bAbI QA](https://gist.github.com/shagunsodhani/12691b76addf149a224c24ab64b5bdcc) dataset used. * QRN on 1K dataset with '2rb' (2 layers + reset gate + bidirectional) model and on 10K dataset with '2rvb' (2 layers + reset gate + vector gate + bidirectional) outperforms MemN2N 1K and 10K models respectively. * Though DMN+ outperforms QRN with a small margin, QRN are simpler and faster to train (the paper made the comment on the speed of training without reporting the training time of the two models). * With very few layers, the model lacks reasoning ability while with too many layers, the model becomes difficult to train. * Using vector gates works for large datasets while hurts for small datasets. * Unidirectional models perform poorly. * The intermediate query updates can be interpreted in natural language to understand the flow of information in the network. |
[link]
#### Introduction * The paper proposes a general and end-to-end approach for sequence learning that uses two deep LSTMs, one to map input sequence to vector space and another to map vector to the output sequence. * For sequence learning, Deep Neural Networks (DNNs) requires the dimensionality of input and output sequences be known and fixed. This limitation is overcome by using the two LSTMs. * [Link to the paper](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) #### Model * Recurrent Neural Networks (RNNs) generalizes feed forward neural networks to sequences. * Given a sequence of inputs $(x_{1}, x_{2}...x_{t})$, RNN computes a sequence of outputs $(y_1, y_2...y_t')$ by iterating over the following equation: $$h_t = sigm(W^{hx}x_t + W^{hh} h_{t-1})$$ $$y^{t} = W^{yh}h_{t}$$ * To map variable length sequences, the input is mapped to a fixed size vector using an RNN and this fixed size vector is mapped to output sequence using another RNN. * Given the long-term dependencies between the two sequences, LSTMs are preferred over RNNs. * LSTMs estimate the conditional probability *p(output sequence | input sequence)* by first mapping the input sequence to a fixed dimensional representation and then computing the probability of output with a standard LST-LM formulation. ##### Differences between the model and standard LSTMs * The model uses two LSTMs (one for input sequence and another for output sequence), thereby increasing the number of model parameters at negligible computing cost. * Model uses Deep LSTMs (4 layers). * The words in the input sequences are reversed to introduce short-term dependencies and to reduce the "minimal time lag". By reversing the word order, the first few words in the source sentence (input sentence) are much closer to first few words in the target sentence (output sentence) thereby making it easier for LSTM to "establish" communication between input and output sentences. #### Experiments * WMT'14 English to French dataset containing 12 million sentences consisting of 348 million French words and 304 million English words. * Model tested on translation task and on the task of re-scoring the n-best results of baseline approach. * Deep LSTMs trained in sentence pairs by maximizing the log probability of a correct translation $T$, given the source sentence $S$ * The training objective is to maximize this log probability, averaged over all the pairs in the training set. * Most likely translation is found by performing a simple, left-to-right beam search for translation. * A hard constraint is enforced on the norm of the gradient to avoid the exploding gradient problem. * Min batches are selected to have sentences of similar lengths to reduce training time. * Model performs better when reversed sentences are used for training. * While the model does not beat the state-of-the-art, it is the first pure neural translation system to outperform a phrase-based SMT baseline. * The model performs well on long sentences as well with only a minor degradation for the largest sentences. * The paper prepares ground for the application of sequence-to-sequence based learning models in other domains by demonstrating how a simple and relatively unoptimised neural model could outperform a mature SMT system on translation tasks. |
[link]
#### Introduction * The paper explores the challenges involved in training deep networks, the effect of unsupervised pre-training on training process and visualizes the error function landscape for deep architectures. * [Link to the paper](http://research.google.com/pubs/pub34923.html) #### Experiments * Datasets used - Shapeset and MNIST. * Train deep architectures for a variable number of layers with and without pre-training. * Weights initialized using random sample from $[\frac{-1}{\sqrt(k)}, \frac{1}{\sqrt(k)}]$ where $k$ is fan-in value. #### Observations * Increasing depth (without pre-training) causes error rate to go up faster than the case of pre-training. * Pre-training also makes the network more robust to random initializations. * At same training cost level, the pre-trained models systematically yields a lower cost than the randomly initialized ones. * Pre-training seems to be most advantageous for smaller training sets. * Pre-training appears to have a regularizing effect - it decreases the variance (for parameter configurations) by restricting the set of possible final configurations for parameter values and introduces a bias. * Pre-training helps for larger layers (with a larger number of units per layer) and for deeper networks. But in the case of small networks, it can lower the performance. * As small networks tend to have a small capacity, this supports the hypothesis that pre-training exhibits a kind of regularizing effect. * Pre-training seems to provide a better marginal conditioning of the weights. Though this is not the only benefit pre-training provides as it captures more intricate dependencies. * Pre-training the lower layers is more important (and impactful) than pre-training the layers closer to the output. * Error landscape seems to be flatter for deep architectures and for the case of pre-training. * Learning trajectories for pre-trained and not pre-trained models start and stay in different regions of function space. Moreover, trajectories of any of the given type initially move together, but at some point, they diverge away. |
[link]
#### Introduction * Open-domain Question Answering (Open QA) - efficiently querying large-scale knowledge base(KB) using natural language. * Two main approaches: * Information Retrieval * Transform question (in natural language) into a valid query(in terms of KB) to get a broad set of candidate answers. * Perform fine-grained detection on candidate answers. * Semantic Parsing * Interpret the correct meaning of the question and convert it into an exact query. * Limitations: * Human intervention to create lexicon, grammar, and schema. * This work builds upon the previous work where an embedding model learns low dimensional vector representation of words and symbols. * [Link](https://arxiv.org/abs/1406.3676) to the paper. #### Task Definition * Input - Training set of questions (paired with answers). * KB providing a structure among the answers. * Answers are entities in KB and questions are strings with one identified KB entity. * The paper has used FREEBASE as the KB. * Datasets * WebQuestions - Built using FREEBASE, Google Suggest API, and Mechanical Turk. * FREEBASE triplets transformed into questions. * Clue Web Extractions dataset with entities linked with FREEBASE triplets. * Dataset of paraphrased questions using WIKIANSWERS. #### Embedding Questions and Answers * Model learns low-dimensional vector embeddings of words in question entities and relation types of FREEBASE such that questions and their answers are represented close to each other in the joint embedding space. * Scoring function $S(q, a)$, where $q$ is a question and $a$ is an answer, generates high score if $a$ answers $q$. * $S(q, a) = f(q)^{T} . g(a)$ * $f(q)$ maps question to embedding space. * $f(q) = W \phi (q)$ * $W$ is a matrix of dimension $K * N$ * $K$ - dimension of embedding space (hyper parameter). * $N$ - total number of words/entities/relation types. * $\psi(q)$ - Sparse Vector encoding the number of times a word appears in $q$. * Similarly, $g(a) = W \psi (a)$ maps answer to embedding space. * $\psi(a)$ gives answer representation, as discussed below. #### Possible Representations of Candidate Answers * Answer represented as a **single entity** from FREEBASE and TBD is a one-of-N encoded vector. * Answer represented as a **path** from question to answer. The paper considers only one or two hop paths resulting in 3-of-N or 4-of-N encoded vectors(middle entities are not recorded). * Encode the above two representations using **subgraph representation** which represents both the path and the entire subgraph of entities connected to answer entity as a subgraph. Two embedding representations are used to differentiate between entities in path and entities in the subgraph. * SubGraph approach is based on the hypothesis that including more information about the answers would improve results. #### Training and Loss Function * Minimize margin based ranking loss to learn matrix $W$. * Stochastic Gradient Descent, multi-threaded with Hogwild. #### Multitask Training of Embeddings * To account for a large number of synthetically generated questions, the paper also multi-tasks the training of model with paraphrased prediction. * Scoring function $S_{prp} (q1, q2) = f(q1)^{T} f(q2)$, where $f$ uses the same weight matrix $W$ as before. * High score is assigned if $q1$ and $q2$ belong to same paraphrase cluster. * Additionally, the model multitasks the task of mapping embeddings of FREEBASE entities (mids) to actual words. #### Inference * For each question, a candidate set is generated. * The answer (from candidate set) with the highest set is reported as the correct answer. * Candidate set generation strategy * $C_1$ - All KB triplets containing the KB entity from the question forms a candidate set. Answers would be limited to 1-hop paths. * $C_2$ - Rank all relation types and keep top 10 types and add only those 2-hop candidates where the selected relations appear in the path. #### Results * $C_2$ strategy outperforms $C_1$ approach supporting the hypothesis that a richer representation for answers can store more information. * Proposed approach outperforms the baseline methods but is outperformed by an ensemble of proposed approach with semantic parsing via paraphrasing model. |
[link]
#### Introduction The [paper](http://arxiv.org/pdf/1502.05698v10) presents a framework and a set of synthetic toy tasks (classified into skill sets) for analyzing the performance of different machine learning algorithms. #### Tasks * **Single/Two/Three Supporting Facts**: Questions where a single(or multiple) supporting facts provide the answer. More is the number of supporting facts, tougher is the task. * **Two/Three Supporting Facts**: Requires differentiation between objects and subjects. * **Yes/No Questions**: True/False questions. * **Counting/List/Set Questions**: Requires ability to count or list objects having a certain property. * **Simple Negation and Indefinite Knowledge**: Tests the ability to handle negation constructs and model sentences that describe a possibility and not a certainty. * **Basic Coreference, Conjunctions, and Compound Coreference**: Requires ability to handle different levels of coreference. * **Time Reasoning**: Requires understanding the use of time expressions in sentences. * **Basic Deduction and Induction**: Tests basic deduction and induction via inheritance of properties. * **Position and Size Reasoning** * **Path Finding**: Find path between locations. * **Agent's Motivation**: Why an agent performs an action ie what is the state of the agent. #### Dataset * The dataset is available [here](https://research.facebook.com/research/-babi/) and the source code to generate the tasks is available [here](https://github.com/facebook/bAbI-tasks). * The different tasks are independent of each other. * For supervised training, the set of relevant statements is provided along with questions and answers. * The tasks are available in English, Hindi and shuffled English words. #### Data Simulation * Simulated world consists of entities of various types (locations, objects, persons etc) and of various actions that operate on these entities. * These entities have their internal state and follow certain rules as to how they interact with other entities. * Basic simulations are of the form: <actor> <action> <object> eg Bob go school. * To add variations, synonyms are used for entities and actions. #### Experiments ##### Methods * N-gram classifier baseline * LSTMs * Memory Networks (MemNNs) * Structured SVM incorporating externally labeled data ##### Extensions to Memory Networks * **Adaptive Memories** - learn the number of hops to be performed instead of using the fixed value of 2 hops. * **N-grams** - Use a bag of 3-grams instead of a bag-of-words. * **Nonlinearity** - Apply 2-layer neural network with *tanh* nonlinearity in the matching function. ##### Structured SVM * Uses coreference resolution and semantic role labeling (SRL) which are themselves trained on a large amount of data. * First train with strong supervision to find supporting statements and then use a similar SVM to find the response. ##### Results * Standard MemNN outperform N-gram and LSTM but still fail on a number of tasks. * MemNNs with Adaptive Memory improve the performance for multiple supporting facts task and basic induction task. * MemNNs with N-gram modeling improves results when word order matters. * MemNNs with Nonlinearity performs well on Yes/No tasks and indefinite knowledge tasks. * Structured SVM outperforms vanilla MemNNs but not as good as MemNNs with modifications. * Structured SVM performs very well on path finding task due to its non-greedy search approach. |
[link]
#### Introduction * LargeVis - a technique to visualize large-scale and high-dimensional data in low-dimensional space. * Problem relates to both information visualization and machine learning (and data mining) domain. * [Link to the paper](https://arxiv.org/abs/1602.00370) #### t-SNE * State of the art method for visualization problem. * Start by constructing a similarity structure from the data and then project the structure to 2/3 dimensional space. * An accelerated version proposed that uses a K-nearest Neighbor (KNN) graph as the similarity structure. * Limitations * Computational cost of constructing KNN graph for high-dimensional data. * Efficiency of graph visualization techniques breaks down for large data ($O(NlogN)$ complexity). * Parameters are sensitive to the dataset. #### LargeVis * Constructs KNN graph (in a more efficient manner as compared to t-SNE). * Uses a principled probabilistic model for graph visualization. * Let us say the input is in the form of $N$ datapoints in $d$ dimensional space. ##### KNN Graph Construction * Random projection tree used for nearest-neighborhood search in the high-dimensional space with euclidean distance as metric of distance. * Tree is constructed by partitioning the entire space and the nodes in the tree correspond to subspaces. * For every non-leaf node of the tree, select a random hyperplane that splits the subspace (corresponding to the leaf node) into two children. * This is done till the number of nodes in the subspace reaches a threshold. * Once the tree is constructed, each data point gets assigned a leaf node and points in the subspace of the leaf node are the candidate nearest neighbors of that datapoint. * For high accuracy, a large number of trees should be constructed (which increases the computational cost). * The paper counters this bottleneck by using the idea "a neighbor of my neighbor is also my neighbor" to increase the accuracy of the constructed graph. * Basically construct an approximate KNN graph using random projection trees and then for each node, search its neighbor's neighbors as well. * Edges are assigned weights just like in t-SNE. ##### Probabilistic Model For Graph Visualization * Given a pair of vertices, the probability of observing an edge between them is given as a function of the distance between the projection of the pair of vertices in the lower dimensional space. * The probability of observing an edge with weight $w$ is given as $w_{th}$ power of probability of observing an edge. * Given a weighted graph, $G$, the likelihood of the graph can be modeled as the likelihood of observed edges plus the likelihood of negative edges (vertex pairs without edges). * To simplify the objective function, some negative edges are sampled corresponding to each vertex using a noisy distribution. * The edges are sampled with probability proportional to their weight and then treated as binary edges. * The resulting objective function can be optimized using asynchronous stochastic gradient descent (very effective on sparse graphs). * The overall complexity is $O(sMN)$, $s$ is the dimension of lower dimensional space, $M$ is the number of negative samples and $N$ is the number of vertices. #### Experiments * Data is first preprocessed with embedding learning techniques like SkipGram and LINE and brought down to 100-dimensional space. * One limitation is that the data is preprocessed to reduce the number of dimensions to 100. This seems to go against the claim of scaling for hundreds of dimensions. * KNN construction is fastest for LargeVis followed by random projection trees, NN Descent, and vantage point trees (used in t-SNE). * LargeVis requires very few iterations to create highly accurate KNN graphs. * KNN Classifier is used to measure the accuracy of graph visualization quantitatively. * Compared to t-SNE, LargeVis is much more stable with respect to learning rate across datasets. * LargeVis benefits from its linear complexity (as compared to t-SNE's log linear complexity) and for default learning rate, outperforms t-SNE for larger datasets. |
[link]
#### Introduction * The paper demonstrates how simple CNNs, built on top of word embeddings, can be used for sentence classification tasks. * [Link to the paper](https://arxiv.org/abs/1408.5882) * [Implementation](https://github.com/shagunsodhani/CNN-Sentence-Classifier) #### Architecture * Pad input sentences so that they are of the same length. * Map words in the padded sentence using word embeddings (which may be either initialized as zero vectors or initialized as word2vec embeddings) to obtain a matrix corresponding to the sentence. * Apply convolution layer with multiple filter widths and feature maps. * Apply max-over-time pooling operation over the feature map. * Concatenate the pooling results from different layers and feed to a fully-connected layer with softmax activation. * Softmax outputs probabilistic distribution over the labels. * Use dropout for regularisation. #### Hyperparameters * RELU activation for convolution layers * Filter window of 3, 4, 5 with 100 feature maps each. * Dropout - 0.5 * Gradient clipping at 3 * Batch size - 50 * Adadelta update rule. #### Variants * CNN-rand * Randomly initialized word vectors. * CNN-static * Uses pre-trained vectors from word2vec and does not update the word vectors. * CNN-non-static * Same as CNN-static but updates word vectors during training. * CNN-multichannel * Uses two set of word vectors (channels). * One set is updated and other is not updated. #### Datasets * Sentiment analysis datasets for Movie Reviews, Customer Reviews etc. * Classification data for questions. * Maximum number of classes for any dataset - 6 #### Strengths * Good results on benchmarks despite being a simple architecture. * Word vectors obtained by non-static channel have more meaningful representation. #### Weakness * Small data with few labels. * Results are not very detailed or exhaustive. |
[link]
#### Introduction * Method to visualize high-dimensional data points in 2/3 dimensional space. * Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation. * Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a non-linear manifold. * t-SNE approach converts data into a matrix of pairwise similarities and visualizes this matrix. * Based on SNE (Stochastic Neighbor Embedding) * [Link to paper](http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) #### SNE * Given a set of datapoints $x_1, x_2, ...x_n, p_{i|j}$ is the probability that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$. Calculation of $\sigma_i$ is described later. * Similarly, define $q_{i|j}$ as conditional probability corresponding to low-dimensional representations of $y_i$ and $y_j$ (corresponding to $x_i$ and $x_j$). The variance of Gaussian in this case is set to be $1/\sqrt{2}$ * Argument is that if $y_i$ and $y_j$ captures the similarity between $x_i$ and $x_j$, $p_{i|j}$ and $q_{i|j}$ should be equal. So objective function to be minimized is Kullback-Leibler (KL) Divergence measure for $P_i$ and $Q_i$, where $P_i$ ($Q_i$) represent conditional probability distribution given $x_i$ ($y_i$) * Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure. * Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find $\sigma_i$ which produces the $P_i$ having same perplexity as specified by the user. * Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing. #### t-SNE (t-Distributed SNE) ##### Symmetric SNE * A single KL Divergence between P (joint probability distribution in high-dimensional space) and Q (joint probability distribution in low-dimensional space) is minimized. * Symmetric because $p_{i|j} = p_{j|i}$ and $q_{i|j} = q_{j|i}$ * More robust to outliers and has a simpler gradient expression. ##### Crowding Problem * When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters. * One way around this problem is to use UNI-SNE but optimization of the cost function, in that case, is difficult. ##### t-SNE * Instead of Gaussian, use a heavy-tailed distribution (like Student-t distribution) to convert distances into probability scores in low dimensions. This way moderate distance in high-dimensional space can be modeled by larger distance in low-dimensional space. * Student-t distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast. * The cost function is easy to optimize. ##### Optimization Tricks ###### Early Compression * At the start of optimization, force the datapoints (in low-dimensional space) to stay close together so that datapoints can easily move from one cluster to another. * Implemented an L2-penalty term proportional to the sum of the squared distance of datapoints from the origin. ###### Early Exaggeration * Scale all the $p_{i|j}$'s so that large $q_{i|j}$*'s are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the low-dimensional space. ##### t-SNE on large datasets * Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets. * Select a random subset of points (called landmark points) to display. * for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point. * $p_{i|j}$ is defined as fraction of random walks starting at $x_i$ and finishing at $x_j$ (both these points are landmark points). This way, $p_{i|j}$ is not sensitive to "short-circuits" in the graph (due to noisy data points). #### Advantages of t-SNE * Gaussian kernel employed by t-SNE (in high-dimensional) defines a soft border between the local and global structure of the data. * Both nearby and distant pair of datapoints get equal importance in modeling the low-dimensional coordinates. * The local neighborhood size of each datapoint is determined on the basis of the local density of the data. * Random walk version of t-SNE takes care of "short-circuit" problem. #### Limitations of t-SNE * It is unclear t-SNE would perform on general **Dimensionality Reduction** for more than 3 dimensions. For such higher (than 3) dimensions, Student-t distribution with more degrees of freedom should be more appropriate. * t-SNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (**curse of dimensionality**). * The cost function for t-SNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution. |
[link]
### Introduction * *Curriculum Learning* - When training machine learning models, start with easier subtasks and gradually increase the difficulty level of the tasks. * Motivation comes from the observation that humans and animals seem to learn better when trained with a curriculum like a strategy. * [Link](http://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf) to the paper. ### Contributions of the paper * Explore cases that show that curriculum learning benefits machine learning. * Offer hypothesis around when and why does it happen. * Explore relation of curriculum learning with other machine learning approaches. ### Experiments with convex criteria * Training perceptron where some input data is irrelevant(not predictive of the target class). * Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane. * Curriculum learning model outperforms no-curriculum based approach. * Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anti-curriculum strategy also outperforms no-curriculum strategy. ### Experiments on shape recognition with datasets having different variability in shapes * Standard(target) dataset - Images of rectangles, ellipses, and triangles. * Easy dataset - Images of squares, circles, and equilateral triangles. * Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called *switch epoch*). * For no-curriculum learning, the first epoch is the *switch epoch*. * As *switch epoch* increases, the classification error comes down with the best performance when *switch epoch* is half the total number of epochs. * Paper does not report results for higher values of *switch epoch*. ### Experiments on language modeling * Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words. * Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary. * Each word from the vocabulary is embedded into a *d* dimensional feature space using a matrix **W** (to be learnt). * The model predicts the score of next word, given a window of words. * Expected value of ranking loss function is minimized to learn **W**. * Curriculum Learning-based model overtakes the other model soon after switching to the target vocabulary, indicating that curriculum-based model quickly learns new words. ### Curriculum as a continuation method * Continuation methods start with a smoothed objective function and gradually move to less smoothed function. * Useful in the case where the objective function in non-convex. * Consider a family of cost functions $C_\lambda (\theta)$ such that $C_0(\theta)$ can be easily optimized and $C_1(\theta)$ is the actual objective function. * Start with $C_0 (\theta)$ and increase $\lambda$, keeping $\theta$ at a local minimum of $C_\lambda (\theta)$. * Idea is to move $\theta$ towards a dominant (if not global) minima of $C_1(\theta)$. * Curriculum learning can be seen as a sequence of training criteria starting with an easy-to-optimise objective and moving all the way to the actual objective. * The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step). ### Advantages of Curriculum Learning * Faster training in the online setting as learner does not try to learn difficult examples when it is not ready. * Guiding training towards better local minima in parameter space, specifically useful for non-convex methods. ### Relation to other machine learning approaches * **Unsupervised preprocessing** - Both have a regularizing effect and lower the generalization error for the same training error. * **Active learning** - The learner would benefit most from the examples that are close to the learner's frontier of knowledge and are neither too hard nor too easy. * **Boosting Algorithms** - Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change. * **Transfer learning** and **Life-long learning** - Initial tasks are used to guide the optimisation problem. ### Criticism * Curriculum Learning is not well understood, making it difficult to define the curriculum. * In one of the examples, anti-curriculum performs better than no-curriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anti-curriculum to perform worse. |
[link]
## Introduction * Neural Network with a recurrent attention model over a large external memory. * Continous form of Memory-Network but with end-to-end training so can be applied to more domains. * Extension of RNNSearch and can perform multiple hops (computational steps) over the memory per symbol. * [Link to the paper](http://arxiv.org/pdf/1503.08895v5.pdf). * [Link to the implementation](https://github.com/facebook/MemNN). ## Approach * Model takes as input $x_1,...,x_n$ (to store in memory), query $q$ and outputs answer $a$. ### Single Layer * Input set ($x_i$) embedded in D-dimensional space, using embedding using matrix $A$ to obtain memory vectors ($m_i$). * Query is also embedded using matrix $B$ to obtain internal state $u$. * Compute match between each memory $m_i$ and $u$ in the embedding space followed by softmax operation to obtain probability vector $p$ over the inputs. * Each $x_i$ maps to an output vector $c_i$ (using embedding matrix $C$). * Output $o$ = weighted sum of transformed input $c_i$, weighted by $p_i$. * Sum of output vector, $o$ and embedding vector, $u$, is passed through weight matrix $W$ followed by softmax to produce output. * $A$, $B$, $C$ and $W$ are learnt by minimizing cross entropy loss. ### Multiple Layers * For layers above the first layer, input $u^{k+1} = u^k + o^k$. * Each layer has its own $A^k$ and $C^k$ - with constraints. * At final layer, output $o = \text{softmax}(W(o^K, u^K))$ ### Constraints On Embedding Vectors * Adjacent * Output embedding for one layer is input embedding for another ie $A^k+1 = C^k$ * $W = C^k$ * $B = A^1$ * Layer-wise (RNN-like) * Same input and output embeddings across layes ie $A^1 = A^2 ... = A^K$ and $C^1 = C^2 ... = C^K$. * A linear mapping $H$ is added to update of $u$ between hops. * $u^{k+1} = Hu^k + o^k$. * $H$ is also learnt. * Think of this as a traditional RNN with 2 outputs * Internal output - used for memory consideration * External output - the predicted result * $u$ becomes the hidden state. * $p$ is an internal output which, combined with $C$ is used to update the hidden state. ## Related Architectures * RNN - Memory stored as the state of the network and unusable in long temporal contexts. * LSTM - Locks network state using local memory cells. Fails over longer temporal contexts. * Memory Networks - Uses global memory. * Bidirectional RNN - Uses a small neural network with sophisticated gated architecture (attention model) to find useful hidden states but unlike MemNN, perform only a single pass over the memory. ## Sentence Representation for Question Answering Task * Bag-of-words representation * Input sentences and questions are embedded as a bag of words. * Can not capture the order of the words. * Position Encoding * Takes into account the order of words. * Temporal Encoding * Temporal information encoded by matrix $T_A$ and memory vectors are modified as $m_i = \text{sum}(Ax_{ij}) + T_A(i)$ * Random Noise * Dummy Memories (empty memories) are added at training time to regularize $T_A$. * Linear Start (LS) training * Removes softmax layers when starting training and insert them when validation loss stops decreasing. ## Observations * Best MemN2N models are close to supervised models in performance. * Position Encoding improves over bag-of-words approach. * Linear Start helps to avoid local minima. * Random Noise gives a small yet consistent boost in performance. * More computational hops leads to improved performance. * For Language Modelling Task, some hops concentrate on recent words while other hops have more broad attention span over all memory locations. |
[link]
### Introduction * Knowledge Bases (KBs) are effective tools for Question Answering (QA) but are often too restrictive (due to fixed schema) and too sparse (due to limitations of Information Extraction (IE) systems). * The paper proposes Key-Value Memory Networks, a neural network architecture based on [Memory Networks](https://gist.github.com/shagunsodhani/c7a03a47b3d709e7c592fa7011b0f33e) that can leverage both KBs and raw data for QA. * The paper also introduces MOVIEQA, a new QA dataset that can be answered by a perfect KB, by Wikipedia pages and by an imperfect KB obtained using IE techniques thereby allowing a comparison between systems using any of the three sources. * [Link to the paper](https://arxiv.org/abs/1606.03126). ### Related Work * TRECQA and WIKIQA are two benchmarks where systems need to select the sentence containing the correct answer, given a question and a list of candidate sentences. * These datasets are small and make it difficult to compare the systems using different sources. * Best results on these benchmarks are reported by CNNs and RNNs with attention mechanism. ### Key-Value Memory Networks * Extension of [Memory Networks Model](https://gist.github.com/shagunsodhani/c7a03a47b3d709e7c592fa7011b0f33e). * Generalises the way context is stored in memory. * Comprises of a memory made of slots in the form of pair of vectors $(k_{1}, v_{1})...(k_{m}, v_{m})$ to encode long-term and short-term context. #### Reading the Memory * **Key Hashing** - Question, *x* is used to preselect a subset of array $(k_{h1}, v_{h1})...(k_{hN}, v_{hN})$ where the key shares atleast one word with *x* and frequency of the words is less than 1000. * **Key Addresing** - Each candidate memory is assigned a relevance probability: * $p_hi$ = softmax($Aφ_X(x).Aφ_K (k_{hi}))$ * φ is a feature map of dimension *D* and *A* is a *dxD* matrix. * **Value Reading** - Value of memories are read by taking their weighted sum using addressing probabilites and a vector *o* is returned. * $o = sum(p_{hi} A\phi_V(v_{hi}))$ * Memory access process conducted by "controller" neural network using $q = Aφ_X (x)$ as the query. * Query is updated using * $q_2 = R_1 (q+o)$ * Addressing and reading steps are repeated using new $R_i$ matrices to retrieve more pertinent information in subsequent access. * After a fixed number of hops, H, resulting state of controller is used to compute a final prediction. * $a = \text{argmax}(\text{softmax}(q_{H+1}^T B\phi_Y (y_i)))$ where $y_i$ are the possible candidate outputs and $B$ is a $dXD$ matrix. * The network is trained end-to-end using a cross entropy loss, backpropogation and stochastic gradient. * End-to-End Memory Networks can be viewed as a special case of Key-Value Memory Networks by setting key and value to be the same for all the memories. #### Variants of Key-Value Memories * $φ_x$ and $φ_y$ - feature map corresponding to query and answer are fixed as bag-of-words representation. ##### KB Triple * Triplets of the form "subject relation object" can be represented in Key-Value Memory Networks with subject and relation as the key and object as the value. * In standard Memory Networks, the whole triplet would have to be embedded in the same memory slot. * The reversed relations "object is_related_to subject" can also be stored. ##### Sentence Level * A document can be split into sentences with each sentence encoded in the key-value pair of the memory slot as a bag-of-words. ##### Window Level * Split the document in the windows of W words and represent it as bag-of-words. * The window becomes the key and the central word becomes the value. ##### Window + Centre Encoding * Instead of mixing the window centre with the rest of the words, double the size of the dictionary and encode the centre of the window and the value using the second dictionary. ##### Window + Title * Since the title of the document could contain useful information, the word window can be encoded as the key and document title as the value. * The key could be augmented with features like "_window_" and "_title_" to distinguish between different cases. ### MOVIEQA Benchmark #### Knowledge Representation * Doc - Raw documents (from Wikipedia) related to movies. * KB - Graph-based KB made of entities and relations. * IE - Performing Information Extraction on Wikipedia to create a KB. * The QA pairs should be answerable by both raw document and KB so that the three approaches can be compared and the gap between the three solutions can be closed. * The dataset has more than 100000 QA pairs, making it much larger than most existing datasets. ### Experiments #### MOVIEQA ##### Systems Compared * [Bordes et al's QA system](TBD) * [Supervised Embeddings](TBD)(without KB) * [Memory Networks](TBD) * Key-Value Memory Networks ##### Observations * Key-Value Memory Networks outperforms all methods on all data sources. * KB > Doc > IE * The best memory representation for directly reading documents uses "Window Level + Centre Encoding + Title". ##### KB vs Synthetic Document Analysis * Given KB triplets, construct synthetic "Wikipedia" articles using templates, conjunctions and coreferences to determine the causes for the gap in performance when using KB vs doc. * Loss in One Template sentences are due to the difficulty of extracting subject, relation and object from the artificial docs. * Using multiple templates does not deteriorate performance much. But conjunctions and coreferences cause a dip in performance. #### WIKIQA * Given a question, select the sentence (from Wikipedia document) that best answers the question. * Key-Value Memory Networks outperforms all other solutions though it is only marginally better than LDC and attentive models based on CNNs and RNNs. |
[link]
## Problem Statement * Evaluating if LSTMs can express and learn short, simple programs (linear time, constant memory) in the sequence-to-sequence framework. * [Link to paper](http://arxiv.org/pdf/1410.4615v3.pdf) ## Approach * Formulate program evaluation task as a sequence-to-sequence learning problem using RNNs. * Train on short programs that can be evaluated in linear time and constant memory - RNN can perform only a single pass over the data and its memory is limited. * Two parameters to control the difficulty of the program: * `length` : Number of digits in the integer that appears in the program. * `nesting` : Number of times operations can be combined with each other. * LSTM reads the input program, one character at a time and produces output, one character at a time. ### Additional Learning Tasks * **Addition Task** - Given two numbers, the model learns to add them. This task becomes the baseline for comparing performance on other tasks. * **Memorization Task** - Give a random number, the model memorizes it and outputs it. Following techniques enhance the accuracy of the model: * **Input reversing** - Reversing the order of input, while keeping the output fixed introduces many short-term dependencies that help LSTM in learning the process. * **Input doubling** - Presenting the same input to the network twice enhances the performance as the model gets to look at the input twice. ## Curriculum learning Gradually increase the difficulty of the program fed to the system. * **No Curriculum (baseline)** - Fixed `length` and fixed `nesting` programs are fed to the system. * **Naive Curriculum** - Start with `length` = 1 and `nesting` = 1 and keep increasing the values iteratively. * **Mix Strategy** - Randomly choose `length` and `nesting` to generate a mix of easy and difficult examples. * **Combined Strategy** - Each training example is obtained either by Naive curriculum strategy or mix strategy. ## Network Architecture * 2 layers, unrolled for 50 steps. * 400 cells per layer. * Parameters initialized uniformly in [-0.08, 0.08] * minibatch size 100 * norm of gradient normalized to be less than 5 * start with learning rate = 0.5, further decreased by 0.8 after reaching target accuracy of 95% ## Observations Teacher forcing technique used for computing accuracy ie when predicting the $i_{th}$ digit, the correct first i-1 digits of the output are provided as input to the LSTM. The general trend is (combine, mix) > (naive, baseline). In certain cases for program evaluation, baseline performs better than naive curriculum strategy. Intuitively, the model would use all its memory to store patterns for a given size input. Now when a higher size input is provided, the model would have to restructure its memory patterns to learn the output for this new class of inputs. The process of memory restructuring may be causing the degraded performance of the naive strategy. The combined strategy combines the naive and mix strategy and hence reduces the need to restructure the memory patterns. While LSTMs can learn to map the character level representation of simple programs to their correct output, the idea can not extend to arbitrary programs due to the runtime limitations of conventional RNNs and LSTM. Moreover, while learning is essential, the optimal curriculum strategy needs to be understood further. |
[link]
#### Introduction * GraphLab abstraction exposes asynchronous, dynamic, graph-parallel computation model in the shared-memory setting. * This paper extends the abstraction to the distributed setting. * [Link](http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf) to the paper. #### Characteristics of MLDM (Machine Learning and Data Mining) * Graph Structured Computation * Sometimes computation requires modeling dependencies between data. * eg modeling dependencies between similar users for the recommendation use case. * Asynchronous Iterative Computation * In many cases, asynchronous procedures outperform synchronous ones. * eg linear systems, belief propagation, stochastic optimization etc. * Dynamic Computation * Iterative computation converges asymmetrically. * Convergence can be accelerated by dynamic scheduling. * eg do not update parameters that have already converged. * Serializability * Ensuring that all parallel executions have an equivalent serial execution is desirable for both correctness and faster convergence. #### GraphLab Abstraction ### Data Graph * Store program state as a directed graph. * **G = (V,E,D)** where D is the user defined data (model parameters, algorithm state, statistical data etc). * The graph data **D** is mutable but the state of the graph **(V,E)** is immutable. #### Update Function * Stateless procedure that modifies the data within the scope of a vertex and schedules the execution of the *update* function on other vertices. * **Scope** of a vertex (S) - data corresponding to the vertex, its edges and its adjacent vertices. * **update: $f (v, S_v) -> (S_v, T)$** where T is the set of vertices where *update* function is scheduled to be invoked. * Scheduling of computation id decoupled from movement of data and no message passing is required between vertices. #### Execution Model * Input to the model is G and T, the initial set of vertices to be updated. * During each step, a vertex is extracted from T, updated and a set of vertices is added to T (for future computation). * Vertices in T can be executed in any order with the only constraint that all vertices be eventually executed. #### Sync Operation * Sync operation runs in the background to maintain global aggregates concurrently. * These global values are read by *update* function and written by the sync operation. #### Consistency Models * Full consistency * Full read/write access in the *scope*. * Scope of concurrently updating vertices cannot overlap. * Edge consistency * Read/write access on the vertex and the adjacent edges but only read access to adjacent vertices. * Slightly overlapping scope. * Vertex consistency * Write access to the vertex and read access to adjacent edges and vertices. * All vertices can run update function simultaneously. ### Distributed Data Graph * Two-phase partitioning process for load balancing the graph on arbitrary cluster size. * In the first phase, partition the graph into k parts (k >> number of machines). * Each part, called **atom**, is a file of graph generating commands. * Atom also stores information about **ghosts** (set of vertices and edges adjacent to the partition boundary). * Atom index file contains connectivity structure and file location for the k atoms as a meta-graph. * In the second phase, this meta-graph is partitioned over the physical machines. ### Distributed GraphLab Engines #### Chromatic Engine * A vertex coloring (no adjacent vertices have the same color) is constructed to serialize parallel execution of dependent tasks (in our case, vertices in the graph). * For edge consistency model, execute all vertices of the same color before going to next color and run sync operation between color steps. * Changes to ghost vertices and edges are communicated asynchronously as they are made. * Vertex consistency is trivial - assign same color to all the vertices. * For full consistency, construct second-order vertex coloring (no vertex shares the same color as any of its distance two neighbors) #### Distributed Locking Engine * Associate reader-writer locks on each vertex. * Each machine can update only the local vertices. * Optimisations * Ghosting system uses caching to eliminate wait on remote, unchanged data. * Lock request and synchronization are pipelined to hide network latency. * Each machine maintains a pipeline of vertices for which locks have been requested but not granted. * A vertex is executed once lock acquisition and data synchronization are complete. * Nonblocking reader-writer locks, that work through callback functions, are used. ### Fault Tolerance * Distributed checkpointing via two modes: * Synchronous checkpointing * Suspend computation to save all modified data since the last checkpoint. * Asynchronous checkpointing based on Chandy-Lamport snapshot algorithm. * The snapshot step becomes an *update* function in the GraphLab abstraction. * Better than synchronous checkpointing. ### System Design * One instance of GraphLab runs on each machine. * These processes are symmetric and communicate via RPC. * The first process additionally acts as the master and computes placement of atoms based on atom index. * Each process maintains a local scheduler (for its vertices) and a cache to access remote data. * Distributed consensus algorithm to decide when all the schedulers are empty. ### Observations * The biggest strength of the paper are its extensive experiments. * GraphLab benefits from the use of background asynchronous communication and pipelined locking but its communication layer is not as efficient as MPI's communication layer. |
[link]
## Introduction * In machine learning, accuracy tends to increase with an increase in the number of training examples and number of model parameters. * For large data, training becomes slow on even GPU (due to increase CPU-GPU data transfer). * Solution: Distributed training and inference - DistBelief * [Link to paper](https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf) ## DistBelief * Framework for parallel and distributed computation in neural networks. * Exploits both model parallelism and data parallelism. * Two methods implemented over DistBelief - *Downpour SGD* and *Sandblaster L-BFGS* ## Previous work in this domain * Distributed Gradient Computation using relaxed synchronization requirements and delayed gradient descent. * Lock-less asynchronous stochastic gradient descent in case of sparse gradient. * Deep Learning * Most focus has been on training small models on a single machine. * Train multiple, small models on a GPU farm and combine their predictions. * Alternate approach is to modify standard deep networks to make them more parallelizable. * Distributed computational tools * MapReduce - Not suited for iterative computations * GraphLab - Can not exploit computing efficiencies available in deep networks. ## Model Parallelism * User specifies the computation at each node in each layer of the neural network and the messages to be exchanged between different nodes. * Model parallelism supported using multithreading (within a machine) and message passing (across machines). * Model can be partitioned across machines with DistBelief taking care of all communication/synchronization tasks. * Diminishing returns in terms of speed gain if too many machines/partitions are used as partitioning benefits as long as network overhead is small. ## Data Parallelism * Distributed training across multiple model instances (or replicas) to optimize a single function. ## Distributed Optimization Algorithms ### Downpour SGD * Asynchronous stochastic gradient descent. * Divide training data into subsets and run a copy of model on each subset. * Centralized sharded parameter server * Keeps the current state of all the parameters of the model. * Used by model instances to share parameters. * Asynchronous approach as both model replicas and parameter server shards run independently. * Workflow * Model replica requests an updated copy of model parameters from parameter server. * Processes mini batches and sends the gradient to the server. * Server updates the parameters. * Communication can be limited by limiting the rate at which parameters are requested and updates are pushed. * Robust to machine failure since multiple instances are running. * Relaxing consistency requirements works very well in practice. * Warm starting - Start training with a single replica and add other replicas later. ### Sandblaster L-BFGS * Sandblaster is the distributed batch Optimization framework. * Provides distributed implementation of L-BFGS. * A single 'coordinator' process interacts with replicas and the parameter server to coordinate the batch optimization process. * Workflow * Coordinator issues commands like dot product to the parameter server shards. * Shards execute the operation, store the results locally and maintain additional information like history cache. * Parameters are fetched at the beginning of each batch and the gradients are sent every few completed cycles. * Saves overhead of sending all parameters and gradients to a single server. * Robust to machine failure just like Downpour SGD. * Coordinator uses a load balancing scheme similar to "backup tasks" in MapReduce. ## Observation Downpour SGD (with Adagrad adaptive learning rate) outperforms Downpour SGD (with fixed learning rate) and Sandblaster L-BFGS. Moreover, Adagrad can be easily implemented locally within each parameter shard. It is surprising that Adagrad works so well with Downpour SGD as it was not originally designed to be used with asynchronous SGD. The paper conjectures that "*Adagrad automatically stabilizes volatile parameters in the face of the flurry of asynchronous updates, and naturally adjusts learning rates to the demands of different layers in the deep network.*" ## Beyond DistBelief - TensorFlow * In November 2015, [Google open sourced TensorFlow](http://googleresearch.blogspot.in/2015/11/tensorflow-googles-latest-machine_9.html) as its second-generation machine learning system. DistBelief had limitations like tight coupling with Google's infrastructure and was difficult to configure. TensorFlow takes care of these issues and is twice fast as DistBelief.Google also published a [white paper](http://download.tensorflow.org/paper/whitepaper2015.pdf) on the topic and the implementation can be found [here](http://www.tensorflow.org/). |
[link]
## Introduction * Introduces techniques to learn word vectors from large text datasets. * Can be used to find similar words (semantically, syntactically, etc). * [Link to the paper](http://arxiv.org/pdf/1301.3781.pdf) * [Link to open source implementation](https://code.google.com/archive/p/word2vec/) ## Model Architecture * Computational complexity defined in terms of a number of parameters accessed during model training. * Proportional to $E*T*Q$ * *E* - Number of training epochs * *T* - Number of words in training set * *Q* - depends on the model ### Feedforward Neural Net Language Model (NNLM) * Probabilistic model with input, projection, hidden and output layer. * Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size). * Input layer projected to projection layer P with dimensionality *N\*D* * Hidden layer (of size *H*) computes the probability distribution over all words. * Complexity per training example $Q =N*D + N*D*H + H*V$ * Can reduce *Q* by using hierarchical softmax and Huffman binary tree (for storing vocabulary). ### Recurrent Neural Net Language Model (RNNLM) * Similar to NNLM minus the projection layer. * Complexity per training example $Q =H*H + H*V$ * Hierarchical softmax and Huffman tree can be used here as well. ## Log-Linear Models * Nonlinear hidden layer causes most of the complexity. * NNLMs can be successfully trained in two steps: * Learn continuous word vectors using simple models. * N-gram NNLM trained over the word vectors. ### Continuous Bag-of-Words Model * Similar to feedforward NNLM. * No nonlinear hidden layer. * Projection layer shared for all words and order of words does not influence projection. * Log-linear classifier uses a window of words to predict the middle word. * $Q = N*D + D*\log_2V$ ### Continuous Skip-gram Model * Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window. * Distant words are given less weight by sampling fewer distant words. * $Q = C*(D + D*log_2 V$) where *C* is the max distance of the word from the middle word. * Given a *C* and a training data, a random *R* is chosen in range *1 to C*. * For each training word, *R* words from history (previous words) and *R* words from future (next words) are marked as target output and model is trained. ## Results * Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece). * Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance. * Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge. * Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman"). |
[link]
## Introduction to elastic net * Regularization and variable selection method. * Sparse Representation * Exihibits grouping effect. * Prticulary useful when number of predictors (*p*) >> number of observations (*n*). * LARS-EN algorithm to compute elastic net regularization path. ## Lasso * Least square method with L1-penalty on regression coefficient. * Does continuous shrinkage and automatic variable selection ### Limitations * If *p >> n*, lasso can select at most *n* variables. * In the case of a group of variables exhibiting high pairwise correlation, lasso doesn't care about which variable is selected. * If *n > p* and there is a high correlation between predictors, ridge regression outperforms lasso. ## Naive elastic net * Least square method. * Penalty on regression cofficients is a convex combination of lasso and ridge penalty. * *penalty = (1−α)\*|β| + α\*|β|<sup>2</sup>* where *β* refers to the coefficient matrix. * *α = 0* => lasso penalty * *α = 1* => ridge penalty * Naive elastic net can be solved by transforming to lasso on augmeneted data. * Can be viewed as redge type shrinkage followed by lasso type thresholding. ### Limitations * The two-stage procedure incurs double amount of shrinkage and introduces extra bias without reducing variance. ## Bridge Regression * Generalization of lasso and ridge regression. * Can not produce sparse solutions. ## Elastic net * Rescaled naive elastic net coefficients to undo shrinkage. * Retains good properties of the naive elastic net. ## Justification for scaling * Elastic net becomes minimax optimal. * Scaling reverses the shrinkage control introduced by ridge regression. ## LARS-EN * Based on LARS (used to solve lasso). * Elastic net can be transformed to lasso on augmented data so can reuse pieces of LARS algorithm. * Use sparseness to save on computation. ## Conclusion Elastic net performs superior to lasso. |
[link]
# TAO * Geographically distributed, read-optimized, graph data store. * Favors availability and efficiency over consistency. * Developed by and used within Facebook (social graph). * Link to [paper](https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/data-store/tao-facebook-distributed-datastore-atc-2013.pdf). ## Before TAO * Facebook's servers directly accessed MySQL to read/write the social graph. * Memcache used as a look-aside cache. * Had several issue: * **Inefficient edge list** - A key-value store is not a good design for storing a list of edges. * **Distributed Control Logic** - In look-aside cache architecture, the control logic runs on the clients which increase the number of failure modes. * **Expensive Read-After-Write Consistency** - Facebook used asynchronous master-slave replication for MySQL which introduced a time lag before latest data would reflect in the local replicas. ## TAO Data Model * **Objects** * Typed nodes (type is denoted by `otype`) * Identified by 64-bit integers. * Contains data in the form of key-value pairs. * Models users and repeatable actions (eg `comments`). * **Associations** * Typed directed edges between objects (type is denoted by `atype`) * Identified by source object `id1`, `atype` and destination object `id2`. * Contains data in the form of key-value pairs. * Contains a 32-bit time field. * Models actions that happen at most once or records state transition (eg `like`) * Often `inverse association` is also meaningful (eg `like` and `liked by`). ## Query API * Support to create, retrieve, update and delete objects and associations. * Support to get all associations (`assoc_get`) or their count(`assoc_count`) based on starting node, time, index and limit parameters. ## TAO Architecture ### Storage Layer * Objects and associations stored in MySQL. * TAO API mapped to SQL queries. * Data divided into logical shards. * Objects bound to the shard for their lifetime(`shard_id` is embedded in `id`). * Associations stored on the shard of its `id` (for faster association query). ### Caching Layer * Consists of multiple cache servers (together form a `tier`). * In memory, LRU cache stores objects, association lists, and association counts. * Write operation on association list with inverse involves writing 2 shards (for `id1` and `id2`). * The client sends the query to cache layer which issues inverse write query to shard2 and once that is completed, a write query is made to shard1. * Write failure leads to hanging associations which are repaired by an asynchronous job. ### Leaders and Followers * A single, large tier is prone to hot spots and square growth in terms of all-to-all connections. * Cache split into 2 levels - one **leader tier** and multiple **follower tiers**. * Clients communicate only with the followers. * In the case of read miss/write, followers forward the request to the leader which connects to the storage layer. * Eventual consistency maintained by serving cache maintenance messages from leaders to followers. * Object update in leaders leads results in `invalidation message` to followers. * Leader sends `refill message` to notify about association write. * Leaders also serialize concurrent writes and mediates thundering herds. ## Scaling Geographically * Since workload is read intensive, read misses are serviced locally at the expense of data freshness. * In the multi-region configuration, there are master-slave regions for each shard and each region has its own followers, leader, and database. * Database in the local region is a replica of the database in the master region. * In the case of read miss, the leader always queries the region database (irrespective of it being the master database or slave database). * In the case of write, the leader in the local region would forward the request to database in the master region. ## Optimisations ### Caching Server * RAM is partitioned into `arena` to extend the lifetime of important data types. * For small, fixed-size items (eg association count), a direct 8-way associative cache is maintained to avoid the use of pointers. * Each `atype` is mapped to 16-bit value to reduce memory footprint. ### Cache Sharding and Hot Spots * Load is balanced among followers through `shard cloning`(reads to a shard are served by multiple followers in a tier). * Response to query include the object's access rate and version number. If the access rate is too high, the object is cached by the client itself. Next time when the query comes, the data is omitted in the reply if it has not changed since the previous version. ### High Degree Objects * In the case of `assoc_count`, the edge direction is chosen on the basis of which node (source or destination) has a lower degree (to optimize reading the association list). * For `assoc_get` query, only those associations are searched where time > object's creation time. ## Failure Detection and Handling * Aggressive network timeouts to detect (potential) failed nodes. ### Database Failure * In the case of master failure, one of the slaves take over automatically. * In case of slave failure, cache miss are redirected to TAO leader in the region hosting the database master. ### Leader Failure * When a leader cache server fails, followers route read miss directly to the database and write to a replacement leader (chosen randomly from the leader tier). ### Refill and Invalidation Failures * Refill and invalidation are sent asynchronously. * If the follower is not available, it is stored in leader's disk. * These messages will be lost in case of leader failure. * To maintain consistency, all the shards mapping to a failed leader are invalidated. ### Follower Failure * Each TAO client is configured with a primary and backup follower tier. * In normal mode, the request is made to primary tier and in the case of its failure, requests go to backup tier. * Read after write consistency may be violated if failing over between different tiers (read reaches the failover target before writer's `refill` or `invalidate`). |
[link]
The [Batch Normalization paper](http://arxiv.org/pdf/1502.03167.pdf) describes a method to address the various issues related to training of Deep Neural Networks. It makes normalization a part of the architecture itself and reports significant improvements in terms of the number of iterations required to train the network. ## Issues With Training Deep Neural Networks ### Internal Covariate shift Covariate shift refers to the change in the input distribution to a learning system. In the case of deep networks, the input to each layer is affected by parameters in all the input layers. So even small changes to the network get amplified down the network. This leads to change in the input distribution to internal layers of the deep network and is known as internal covariate shift. It is well established that networks converge faster if the inputs have been whitened (ie zero mean, unit variances) and are uncorrelated and internal covariate shift leads to just the opposite. ### Vanishing Gradient Saturating nonlinearities (like tanh or sigmoid) can not be used for deep networks as they tend to get stuck in the saturation region as the network grows deeper. Some ways around this are to use: * Nonlinearities like ReLU which do not saturate * Smaller learning rates * Careful initializations ## Normalization Let us say that the layer we want to normalize has *d* dimensions **x** $= (x_1, ... x_d)$. Then, we can normalize the $k^th$ dimension as follows: ![Scaled and shifted normalized value](https://db.tt/YORi6lov) We also need to scale and shift the normalized values otherwise just normalizing a layer would limit the layer in terms of what it can represent. For example, if we normalize the inputs to a sigmoid function, then the output would be bound to the linear region only. So the normalized input $x^k$ is transformed to: ![Scaled and shifted normalized value](https://db.tt/6vImAQoc) where $γ$ and $β$ are parameters to be learned. Moreover, just like we use mini-batch in Stochastic Gradient Descent (SGD), we can use mini-batch with normalization to estimate the mean and variance for each activation. The transformation from $x$ to $y$ as described above is called **Batch Normalizing Tranform**. This BN transform is differentiable and ensures that as the model is training, the layers can learn on the input distributions that exhibit less internal covariate shift and can hence accelerate the training. At training time, a subset of activations in specified and BN transform is applied to all of them. During test time, the normalization is done using the population statistics instead of mini-batch statistics to ensure that the output deterministically depends on the input. ## Batch Normalized Convolutional Networks Let us say that $x = g(Wu+b)$ is the operation performed by the layer where $W$ and $b$ are the parameters to be learned, $g$ is a nonlinearity and $u$ is the input from the previous layer. The BN transform is added just before the nonlinearity, by normalizing $x = Wu+b$. An alternative would have been to normalize $u$ itself but constraining just the first and the second moment would not eliminate the covariate shift from $u$. When normalizing $Wu+b$, we can ignore the $b$ term as it would be canceled during the normalization step (*b*'s role is subsumed by β) and we have $z = g( BN(Wu) )$ For convolutional layers, normalization should follow the convolution property as well - ie different elements of the same feature map, at different locations, are normalized in the same way. So all the activations in a mini-batch are jointly normalized over all the locations and parameters (*γ* and *β*) are learnt per feature map instead of per activation. ## Advantages Of Batch Normalization 1. Reduces internal covariant shift. 2. Reduces the dependence of gradients on the scale of the parameters or their initial values. 3. Regularizes the model and reduces the need for dropout, photometric distortions, local response normalization and other regularization techniques. 4. Allows use of saturating nonlinearities and higher learning rates. Batch Normalization was applied to models trained for MNIST and Inception Network for ImageNet. All the above-mentioned advantages were validated in the experiments. Interestingly, Batch Normalization with sigmoid achieved an accuracy of 69.8% (overall best, using any nonlinearity, was 74.8%) while Inception model (sigmoid nonlinearity), without Batch Normalisation, worked only as good as a random guess. ## Future Work While BN Transform does enhance the overall deep network training task, its precise effect on gradient propagation is still not well understood. A future extension of Batch Normalisation would be in the domain of Recurrent Neural Networks where internal covariate shift and vanishing gradients are more severe. It remains to be explored if it can also help with domain adaption by easily generalizing to new data distributions.
2 Comments
|
[link]
The [paper](http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf) presents Twitter's logging infrastructure, how it evolved from application specific logging to a unified logging infrastructure and how session-sequences are used as a common case optimization for a large class of queries. ## Messaging Infrastructure Twitter uses **Scribe** as its messaging infrastructure. A Scribe daemon runs on every production server and sends log data to a cluster of dedicated aggregators in the same data center. Scribe itself uses **Zookeeper** to discover the hostname of the aggregator. Each aggregator registers itself with Zookeeper. The Scribe daemon consults Zookeeper to find a live aggregator to which it can send the data. Colocated with the aggregators is the staging Hadoop cluster which merges the per-category stream from all the server daemons and writes the compressed results to HDFS. These logs are then moved into main Hadoop data warehouse and are deposited in per-category, per-hour directory (eg /logs/category/YYYY/MM/DD/HH). Within each directory, the messages are bundled in a small number of large files and are partially ordered by time. Twitter uses **Thrift** as its data serialization framework, as it supports nested structures, and was already being used elsewhere within Twitter. A system called **Elephant Bird** is used to generate Hadoop record readers and writers for arbitrary thrift messages. Production jobs are written in **Pig(Latin)** and scheduled using **Oink**. ## Application Specific Logging Initially, all applications defined their own custom formats for logging messages. While it made it easy to develop application logging, it had many downsides as well. * Inconsistent naming conventions: eg uid vs userId vs user_Id * Inconsistent semantics associated with each category name causing resource discovery problem. * Inconsistent format of log messages. All these issues make it difficult to reconstruct user session activity. ## Client Events This is an effort within Twitter to develop a unified logging framework to get rid of all the issues discussed previously. A hierarchical, 6-level schema is imposed on all the events (as described in the table below). | Component | Description | Example | |-----------|------------------------------------|----------------------------------------------| | client | client application | web, iPhone, android | | page | page or functional grouping | home, profile, who_to_follow | | section | tab or stream on a page | home, mentions, retweets, searches, suggestions | | component | component object or objects | search_box, tweet | | element | UI element within the component | button, avatar | | action | actual user or application action | impression, click, hover | **Table 1: Hierarchical decomposition of client event names.** For example, the following event, `web:home:mentions:stream:avatar:profile_click` is logged whenever there is an image profile click on the avatar of a tweet in the mentions timeline for a user on twitter.com (read from right to left). The alternate design was a tree based model for logging client events. That model allowed for arbitrarily deep event namespace with as fine-grained logging as required. But the client events model was chosen to make the top level aggregate queries easier. A client event is a Thrift structure that contains the components given in the table below. | Field | Description | |-----------------|---------------------------------| | event initiator | {client, server} × {user, app} | | event_name | event name | | user_id | user id | | session_id | session id | | ip | user’s IP address | | timestamp | timestamp | | event_details | event details | **Table 2: Definition of a client event.** The logging infrastructure is unified in two senses: * All log messages share a common format with clear semantics. * All log messages are stored in a single place. ## Session Sequences A session sequence is a sequence of symbols *S = {s<sub>0</sub>, s<sub>1</sub>, s<sub>2</sub>...s<sub>n</sub>}* such that each symbol is drawn from a finite alphabet *Σ*. A bijective mapping is defined between Σ and universe of event names. Each symbol in Σ is represented by a valid Unicode point (frequent events are assigned shorter code prints) and each session sequence becomes a valid Unicode string. Once all logs have been imported to the main database, a histogram of event counts is created and is used to map event names to Unicode code points. The counts and samples of each event type are stored in a known location in HDFS. Session sequences are reconstructed from the raw client event logs via a *group-by* on *user_id* and *session_id*. Session sequences are materialized as it is difficult to work with raw client event logs for following reasons: * A lot of brute force scans. * Large group-by operations needed to reconstruct user session. #### Alternate Designs Considered * Reorganize complete Thrift messages by reconstructing user sessions - This solves the second problem but not the first. * Use a columnar storage format - This addresses the first issue but it just reduces the time taken by mappers and not the number of mappers itself. The materialized session sequences are much smaller than raw client event logs (around 50 times smaller) and address both the issues. ## Client Event Catalog To enhance the accessibility of the client event logs, an automatically generated event data log is used along with a browsing interface to allow users to browse, search and access sample entries for the various client events. (These sample entries are the same entries that were mentioned in the previous section. The catalog is rebuilt every day and is always up to date. ## Applications Client Event Logs and Session Sequences are used in following applications: * Summary Statistics - Session sequences are used to compute various statistics about sessions. * Event Counting - Used to understand what feature of users take advantage of a particular feature. * Funnel Analytics - Used to focus on user attention in a multi-step process like signup process. * User Modeling - Used to identify "interesting" user behavior. N-gram models (from NLP domain) can be extended to measure how important temporal signals are by modeling user behavior on the basis of last n actions. The paper also mentions the possibility of extracting "activity collocations" based on the notion of collocations. ## Possible Extensions Session sequences are limited in the sense that they capture only event name and exclude other details. The solution adopted by Twitter is to use a generic indexing infrastructure that integrates with Hadoop at the level of InputFormats. The indexes reside with the data making it easier to reindex the data. An alternative would have been to use **Trojan layouts** which members indexing in HDFS block header but this means that indexing would require the data to be rewritten. Another possible extension would be to leverage more analogies from the field of Natural Language Processing. This would include the use of automatic grammar induction techniques to learn hierarchical decomposition of user activity. Another area of exploration is around leveraging advanced visualization techniques for exploring sessions and mapping interesting behavioral patterns into distinct visual patterns that can be easily recognized. |
[link]
The [paper](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) presents some key lessons and "folk wisdom" that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks. ### 1. Learning = Representation + Evaluation + Optimization All machine learning algorithms have three components: * **Representation** for a learner is the set if classifiers/functions that can be possibly learnt. This set is called *hypothesis space*. If a function is not in hypothesis space, it can not be learnt. * **Evaluation** function tells how good the machine learning model is. * **Optimisation** is the method to search for the most optimal learning model. ### 2. Its Generalization That Counts The fundamental goal of machine learning is to generalize beyond the training set. The data used to evaluate the model must be kept separate from the data used to learn the model. When we use generalization as a goal, we do not have access to a function that we can optimize. So we have to use training error as a proxy for test error. ### 3. Data Alone Is Not Enough Since our ultimate goal is generalization (see point 2), there is no such thing as **"enough"** data. Some knowledge beyond the data is needed to generalize beyond the data. Another way to put is "No learner can beat random guessing over all possible functions." But instead of hard-coding assumptions, learners should allow assumptions to be explicitly stated, varied and incorporated automatically into the model. ### 4. Overfitting Has Many Faces One way to interpret overfitting is to break down generalization error into two components: bias and variance. **Bias** is the tendency of the learner to constantly learn the same wrong thing (in the image, a high bias would mean more distance from the centre). **Variance** is the tendency to learn random things irrespective of the signal (in the image, a high variance would mean more scattered points). ![Bias Variance Diagram](https://dl.dropboxusercontent.com/u/56860240/A-Paper-A-Week/BiasVarianceDiagram.png) A more powerful learner (one that can learn many models) need not be better than a less powerful one as they can have a high variance. While noise is not the only reason for overfitting, it can indeed aggravate the problem. Some tools against overfitting are - **cross-validation**, **regularization**, **statistical significance testing**, etc. ### 5. Intuition Fails In High Dimensions Generalizing correctly becomes exponentially harder as dimensionality (number of features) become large. Machine learning algorithms depend on similarity-based reasoning which breaks down in high dimensions as a fixed-size training set covers only a small fraction of the large input space. Moreover, our intuitions from three-dimensional space often do not apply to higher dimensional spaces. So the **curse of dimensionality** may outweigh the benefits of having more features. Though, in most cases, learners benefit from the **blessing of non-uniformity** as data points are concentrated in lower-dimensional manifolds. Learners can implicitly take advantage of this lower effective dimension or use dimensionality reduction techniques. ### 6. Theoretical Guarantees Are Not What They Seem A common type of bound common when dealing with machine learning algorithms is related to the number of samples needed to ensure good generalization. But these bounds are very loose in nature. Moreover, the bound says that given a large enough training dataset, our learner would return a good hypothesis with high probability or would not find a consistent hypothesis. It does not tell us anything about how to select a good hypothesis space. Another common type of bound is the asymptotic bound which says "given infinite data, the learner is guaranteed to output correct classifier". But in practice we never have infinite data and data alone is not enough (see point 3). So theoretical guarantees should be used to understand and drive the algorithm design and not as the only criteria to select algorithm. ### 7. Feature Engineering Is The Key Machine Learning is an iterative process where we train the learner, analyze the results, modify the learner/data and repeat. Feature engineering is a crucial step in this pipeline. Having the right kind of features (independent features that correlate well with the class) makes learning easier. But feature engineering is also difficult because it requires domain specific knowledge which extends beyond just the data at hand (see point 3). ### 8. More Data Beats A Clever Algorithm As a rule of thumb, a dumb algorithm with lots of data beats a clever algorithm with a modest amount of data. But more data means more scalability issues. Fixed size learners (parametric learners) can take advantage of data only to an extent beyond which adding more data does not improve the results. Variable size learners (non-parametric learners) can, in theory, learn any function given sufficient amount of data. Of course, even non-parametric learners are bound by limitations of memory and computational power. ### 9. Learn Many Models, Not Just One In early days of machine learning, the model/learner to be trained was pre-determined and the focus was on tuning it for optimal performance. Then the focus shifted to trying many variants of different learners. Now the focus is on combining the various variants of different algorithms to generate the most optimal results. Such model ensembling techniques include *bagging*, *boosting* and *stacking*. ### 10. Simplicity Does Not Imply Accuracy Though Occam's razor suggests that machine learning models should be kept simple, there is no necessary connection between the number of parameters of a model and its tendency to overfit. The complexity of a model can be related to the size of hypothesis space as smaller spaces allow the hypothesis to be generated by smaller, simpler codes. But there is another side to this picture - A learner with a larger hypothesis space that tries fewer hypotheses is less likely to overfit than one that tries more hypotheses from a smaller space. So hypothesis space size is just a rough guide towards accuracy. Domingos conclude in his [other paper](http://homes.cs.washington.edu/~pedrod/papers/dmkd99.pdf) that "simpler hypotheses should be preferred because simplicity is a virtue in its own right, not because of a hypothetical connection with accuracy." ### 11. Representation Does Not Imply Learnable Just because a function can be represented, does not mean that the function can actually be learnt. Restrictions imposed by data, time and memory, limit the functions that can actually be learnt in a feasible manner. For example, decision tree learners can not learn trees with more leaves than the number of training data points. The right question to ask is "whether a function can be learnt" and not "whether a function can be represented". ### 12. Correlation Does Not Imply Causation Correlation may hint towards a possible cause and effect relationship but that needs to be investigated and validated. On the face of it, correlation can not be taken as proof of causation. |
[link]
[Hive](http://infolab.stanford.edu/~ragho/hive-icde2010.pdf) is an open-source data warehousing solution built on top of Hadoop. It supports an SQL-like query language called HiveQL. These queries are compiled into MapReduce jobs that are executed on Hadoop. While Hive uses Hadoop for execution of queries, it reduces the effort that goes into writing and maintaining MapReduce jobs. # Data Model Hive supports database concepts like tables, columns, rows and partitions. Both primitive (integer, float, string) and complex data-types(map, list, struct) are supported. Moreover, these types can be composed to support structures of arbitrary complexity. The tables are serialized/deserialized using default serializers/deserializer. Any new data format and type can be supported by implementing SerDe and ObjectInspector java interface. # Query Language Hive query language (HiveQL) consists of a subset of SQL along with some extensions. The language is very SQL-like and supports features like subqueries, joins, cartesian product, group by, aggregation, describe and more. MapReduce programs can also be used in Hive queries. A sample query using MapReduce would look like this: ``` FROM ( MAP inputdata USING 'python mapper.py' AS (word, count) FROM inputtable CLUSTER BY word ) REDUCE word, count USING 'python reduce.py'; ``` This query uses `mapper.py` for transforming `inputdata` into `(word, count)` pair, distributes data to reducers by hashing on `word` column (given by `CLUSTER`) and uses `reduce.py`. Notice that Hive allows the order of `FROM` and `SELECT/MAP/REDUCE` to be changed within a sub-query. This allows insertion of different transformation results into different tables/partitions/hdfs/local directory as part of the same query and reduces the number of scans on the input data. ### Limitations * Only equi-joins are supported. * Data can not be inserted into existing table/partitions and all inserts overwrite the data. `INSERT INTO, UPDATE`, and `DELETE` are not supported which makes it easier to handle reader and writer concurrency. # Data Storage While a table is the logical data unit in Hive, the data is actually stored into hdfs directories. A **table** is stored as a directory in hdfs, **partition** of a table as a subdirectory within a directory and **bucket** as a file within the table/partition directory. Partitions can be created either when creating tables or by using `INSERT/ALTER` statement. The partitioning information is used to prune data when running queries. For example, suppose we create partition for `day=monday` using the query ``` ALTER TABLE dummytable ADD PARTITION (day='monday') ``` Next, we run the query - ``` SELECT * FROM dummytable WHERE day='monday' ``` Suppose the data in dummytable is stored in `/user/hive/data/dummytable` directory. This query will only scan the subdirectory `/user/hive/data/dummytable/day=monday` within the `dummytable` directory. A **bucket** is a file within the leaf directory of a table or a partition. It can be used to prune data when the user runs a `SAMPLE` query. Any data stored in hdfs can be queried using the `EXTERNAL TABLE` clause by specifying its location with the `LOCATION` clause. When dropping an external table, only its metadata is deleted and not the data itself. # Serialization/Deserialization Hive implements the `LazySerDe` as the default SerDe. It deserializes rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. Hive also provides a `RegexSerDe` which allows the use of regular expressions to parse columns out from a row. Hive also supports various formats like `TextInputFormat`, `SequenceFileInputFormat` and `RCFileInputFormat`. Other formats can also be implemented and specified in the query itself. For example, ``` CREATE TABLE dummytable(key INT, value STRING) STORED AS INPUTFORMAT org.apache.hadoop.mapred.SequenceFileInputFormat OUTPUTFORMAT org.apache.hadoop.mapred.SequenceFileOutputFormat ``` # System Architecture and Component ### Components * **Metastore** - Stores system catalog and metadata about tables, columns and partitions. * **Driver** - Manages life cycle of a HiveQL statement as it moves through Hive. * **Query Compiler** - Compiles query into a directed acyclic graph of MapReduce tasks. * **Execution Engine** - Execute tasks produced by the compiler in proper dependency order. * **Hive Server** - Provides a thrift interface and a JDBC/ODBC server. * **Client components** - CLI, web UI, and JDBC/ODBC driver. * **Extensibility Interfaces** - Interfaces for SerDe, ObjectInspector, UDF (User Defined Function) and UDAF (User-Defined Aggregate Function). ### Life Cycle of a query The query is submitted via CLI/web UI/any other interface. This query goes to the compiler and undergoes parse, type-check and semantic analysis phases using the metadata from Metastore. The compiler generates a logical plan which is optimized by the rule-based optimizer and an optimized plan in the form of DAG of MapReduce and hdfs tasks is generated. The execution engine executes these tasks in the correct order using Hadoop. ### Metastore It stores all information about the tables, their partitions, schemas, columns and their types, etc. Metastore runs on traditional RDBMS (so that latency for metadata query is very small) and uses an open source ORM layer called DataNuclues. Matastore is backed up regularly. To make sure that the system scales with the number of queries, no metadata queries are made the mapper/reducer of a job. Any metadata needed by the mapper or the reducer is passed through XML plan files that are generated by the compiler. ### Query Compiler Hive Query Compiler works similar to traditional database compilers. * Antlr is used to generate the Abstract Syntax Tree (AST) of the query. * A logical plan is created using information from the metastore. An intermediate representation called query block (QB) tree is used when transforming AST to operator DAG. Nested queries define the parent-child relationship in QB tree. * Optimization logic consists of a chain of transformation operations such that output from one operation is input to next operation. Each transformation comprises of a **walk** on operator DAG. Each visited **node** in the DAG is tested for different **rules**. If any rule is satisfied, its corresponding **processor** is invoked. **Dispatcher** maintains a mapping fo different rules and their processors and does rule matching. **GraphWalker** manages the overall traversal process. * Logical plan generated in the previous step is split into multiple MapReduce and hdfs tasks. Nodes in the plan correspond to physical operators and edges represent the flow of data between operators. ### Optimisations * **Column Pruning** - Only the columns needed in the query processing are projected. * **Predicate Pushdown** - Predicates are pushed down to the scan so that rows are filtered as early as possible. * **Partition Pruning** - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate. * **Map Side Joins** - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers. * **Join Reordering** - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements. Some optimizations are not enabled by default but can be activated by setting certain flags. These include: * Repartitioning data to handle skew in `GROUP BY` processing.This is achieved by performing `GROUP BY` in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers. * Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data. ### Execution Engine Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say `INSERT INTO` query). # Future Work The paper mentions the following areas for improvements: * HiveQL should subsume SQL syntax. * Implementing a cost-based optimizer and using adaptive optimization techniques. * Columnar storage to improve scan performance. * Enhancing JDBS/ODBC drivers for Hive to integrate with other BI tools. * Multi-query optimization and generic n-way join in a single MapReduce job. |
[link]
## Introduction A variable x is said to obey a power-law if it is drawn from a probability distribution function (pdf) of the form $p(x) = Cx^{-\alpha}$ where $C$ is called the **normalization constant** and $\alpha$ is called **scaling parameter** or exponent. Often, the power-law applies only for values greater than some minimum $x$, called $x_\text{min}$. The paper describes various statistical techniques to test if a given distribution follows a power-law or not. Power-law distributions come in both continuous and discrete flavor with the discrete case being more involved than the continuous one. So, the discrete power-law behavior is often approximated by continuous power-law behavior for the sake of convenience. One reliable approximation is to assume that discrete values of $x$ are generated from a continuous power-law and then rounded to nearest integer to get the discrete values. Sometimes, complementary cumulative distribution function (or CDF) is also considered where $P(X) = p(x ≥ X)$ ## Fitting power-laws to empirical data Power-law distribution makes a straight line on the log-log plot. This slope can be calculated using the method of least square linear regression. But simple line fitting does not guarantee that data follows a power-law distribution. Moreover, the assumption of independent, Gaussian noise, which is a pre-requisite for linear regression, does not hold for this case. ### Estimating scaling parameter Assuming that we know the value of $x_\text{min}$, the value of $\alpha$ can be obtained by the *method of maximum likelihood*. Maximum likelihood estimator (MLE) for continuous case is given as: https://github.com/shagunsodhani/powerlaw/raw/master/paper/assets/MLEContinous.png and that for the discrete case is given as: https://github.com/shagunsodhani/powerlaw/raw/master/paper/assets/MLEDiscrete.png The equation for the discrete case is only an approximation as there is no exact MLE for discrete case. MLE method outperforms several linear regression based approaches like line fitting on the log-log plot, line fitting after performing logarithmic binning (done to reduce fluctuations in the tail of the distribution), line fitting to CDF with constant size bins and line fitting to CDF without any bins. But for any finite sample size $n$ and any choice of $x_\text{min}$, there is bias present which decays as $O(1/n)$ and can be ignored for $n ≥ 50$ ### Estimating $x_\text{min}$ If we choose a value of $x_\text{min}$ less than the original value, then we will get a biased value of $\alpha$ as we will be fitting power-law to non power-law region as well. If we choose a value larger than the original value, we will be losing legitimate data points (leading to statistical errors). But it is more acceptable to make a higher estimate of $x_\text{min}$ than the original value. One approach is to plot the PDF or CDF on the log-log plot and mark the point beyond which the distribution becomes roughly straight or to plot $\alpha$ as a function of $x_\text{min}$ and mark the point beyond which the value appears relatively stable. But these approaches are not objective as roughly straight and relatively stable are not quantified. The approach proposed by Clauset et al. [Clauset, A., M. Young, and K. S. Gleditsch, 2007, Journal of Conflict Resolution 51, 58] is as follows: Choose a value of $x_\text{min}$ such that the probability distribution of the measured data and best-fit power-law model are as similar as possible. Similarity between distributions can be measured using Kolmogorov-Smirnov (KS) statistic which is defined as: https://github.com/shagunsodhani/powerlaw/raw/master/paper/assets/KSStatistics.png where $S(x)$ is CDF of given data with values greater than or equal to $x_\text{min}$ and *P(x)* is the CDF of power-law model that best fits the data in the region $x ≥ x_\text{min}$. This method works well for both continuous and discrete data and is recommended for the general case. Other measures, like weighted KS or Kuiper statistics, can also be used in place of KS statistic. ## Testing the power-law hypothesis MLE and other approaches do not tell us whether power-law is a possible fit to the given data - all they do is find the best fit values of $x_\text{min}$ and $\alpha$ assuming the data comes from a power-law distribution. A basic approach would be to calculate the value of $x_\text{min}$ and $\alpha$ and use them to hypothesize a power-law distribution from which the data is drawn. We then check the validity of this hypothesis using the goodness-of-fit tests. ### Goodness-of-fit tests A large number of synthetic data sets are generated from the hypothesized power-law distribution. Then each of these distributions is fitted to their own power-law model individually and the KS statistics is calculated for each distribution. The *p-* value is defined to be the fraction of synthetic datasets where the distance (KS statistic value) is greater than the distance for given dataset. A large value of *p* (close to 1) means that the fluctuations between given data and the hypothesized model could be because of statistical fluctuations alone while a small value of *p* (close to 0) means that the model is not a possible fit to the distribution. ### Dataset generation The generated dataset needs to be such that it has a distribution similar to the given data below $x_\min$ and follows the fitted power-law above $x_\text{min}$. Suppose the given data has *n<sub>tail</sub>* observations (where $x ≥ x_\text{min}$) and *n* observations in total. With a probability of $x_\text{min}/n$, a random number $x_i$ is generated from the hypothesized power-law distribution. With a probability of $1- n_{tail}/n$, a number is picked randomly from the given dataset with $x < x_\text{min}$. This way, the generated dataset of n elements is expected to follow powerlaw above $x ≥ x_\text{min}$ and same distribution as given data below $x_\text{min}$. If we want the *p-* values to be accurate to within about *ε* of the true value, then we should generate at least *1/4 ε<sup>-2</sup>* synthetic data sets. The power law is ruled out if *p ≤ 0.1*. A large *p-* value does not mean that the power-law is the correct distribution for the data. There can be other distributions that can fit the data equally well or even better. Moreover, for small values of n, it is possible that the given distribution will follow a power law closely, and hence that the p-value will be large, even when the power law is the wrong model for the data. ## Alternate distributions *p-* value test can only be used to reject the power-law hypothesis and not accept it. So even if *p-value > 0.1*, we can only say that power-law hypothesis is not rejected. It could be the case that some other distribution fits the data equally well or even better. To eliminate this possibility, we calculate a *p-* value for a fit to the alternate distribution and compare it with the *p-* value for the power-law. If the *p-* value for power-law is high and the *p-* value for the other distribution is low, we can say that data is more likely to be drawn from the power-law distribution (though we still can not be sure that it is **definitely** drawn from the power-law distribution). ### Likelihood Ratio Test This test can be used to directly compare two distributions against one another to see which is a better fit for the given data. The idea is to compute the likelihood of the given data under the two competing distributions. The one with the higher likelihood is taken to be the better fit. Alternatively, the ratio of the two likelihoods, or the logarithm *R* of the ratio can be used. If *R* is close enough to zero, then it could go to either side of zero, depending on statistical fluctuations. So *R* value needs to be sufficiently far from zero. To check for this, Vuong's method [Vuong, Q. H., 1989, Econometrica 57, 307] is used which gives a *p-* value that can tell if the conclusion from the value of *R* is statistically significant. If this *p-* value is small (*p < 0.1*), the result is significant. Otherwise, the result is not taken to be reliable and the test does not favor either distribution. Other than the likelihood ratio, several other tests like minimum description length (MDL) or cross-validation can also be performed. |
[link]
The [Pregel paper](https://kowshik.github.io/JPregel/pregel_paper.pdf) introduces a vertex-centric, large-scale graph computational model. Interestingly, the name Pregel comes from the name of the river which the Seven Bridges of Königsberg spanned. ### Computational Model The system takes as input a directed graph with properties assigned to both vertices and edges. The computation consists of a sequence of iterations, called supersteps. In each superstep, a user-defined function is invoked on each vertex in parallel. This function essentially implements the algorithm by specifying the behaviour of a single vertex V during a single superstep S. The function can read messages sent to the vertex V during the previous superstep (S-1), change the state of the vertex or its out-going edges', mutate the graph topology by adding/removing vertices or edges and by sending messages to other vertices that would be received in the next superstep (S+1). Since all computation during a superstep is performed locally, the model is well suited for distributed computing and synchronization is needed only between supersteps. The computation terminates when every vertex is in the deactivated state. When computation starts, all vertices are in active state. A vertex deactivates itself by voting to halt and once deactivated, it does not take part in subsequent supersteps. But any time a deactivated vertex receives a message, it becomes activated again and takes part in subsequent supersteps. The resulting state machine is shown below: ![Vertex State Machine](https://db.tt/Pdm1Rl6A) The output of the computation is the set of values produced by the vertices. Pregel adopts a pure message passing model that eliminates the need of shared memory and remote reads. Messages can be delivered asynchronously thereby reducing the latency. Graph Algorithms can also be expressed as a sequence of MapReduce jobs, but that requires passing the entire state of the graph from one stage to another. It also requires coordinating the various steps of chained MapReduce. In contrast, Pregel keeps vertices and out-going edges on machine performing the computation and only messages are transferred across. Though Pregel is similar in concept to MapReduce, it comes with a natural graph API and efficient support for running iterative algorithms over the graph. ### API 1. Pregel programs are written by subclassing the vertex class. 2. The programmer overrides the compute() method. This is the user-defined function that is invoked on each vertex. 3. Within compute(), the programmer can modify the value of vertex or out-going edges' properties and these changes are reflected immediately. These are the only per-vertex state that persists throughout each superstep. 4. All messages sent to vertex V during superstep S are available via an iterator in superstep S+1. While it is guaranteed that each message would be delivered exactly once, there is no guarantee on the order of delivery. 5. User defined handlers are invoked when destination vertex does not exist. 6. Combiners are used to reduce the number of messages to be transferred by aggregating messages from different vertices (all of which are on the same machine) which have the same destination vertex. Since there is no guarantee about which messages will be aggregated and in what order, only commutative and associative operations can be used. 7. Aggregators can be used for capturing the global state of the graph where each vertex provides a value to the aggregator during each superstep and these values are combined by the system using a reduce operation (which should be associative and commutative). The aggregated value is then available to all the vertices in the next superstep. ### Topology Mutation Since compute() method allows the graph topology to be modified, conflicting requests can be made in the same superstep. Two mechanisms are used to handles the conflicts: * Within a super step, following order is followed when resolving conflicting operations: * Edge removal * Vertex removal * Vertex addition * Edge addition. * User defined handlers are used for conflicts that can not be resolved by partial ordering alone. This would include scenarios where there are multiple requests to create a vertex with different initial values. In these cases, the user defines how to resolve the conflict. The coordination mechanism is lazy in the sense that global mutations do not require coordination until the point they are applied. ### Execution Steps * Several instances of the user program start executing on a cluster of machines. One of the instances becomes the master and the rest are the workers. * The master partitions the graph based on vertex id with each partition consisting of a set of vertices and its out-going edges. * These partitions are assigned to workers. Each worker executes the compute method on all its vertices and manages messages to and from other vertices. * The master instructs each worker to perform a superstep. The workers run the computer method on each vertex and tell the master how many vertices would be active in the next superstep. This continues as long as even one vertex is active. * Once all the vertices become deactivated, the master may ask the workers to save their portion of the graph. ### Fault Tolerance Fault tolerance is achieved through checkpointing where the master instructs the workers to save the state of computation to persistent storage. Master issues regular "ping" messages to workers and if a worker does not receive a message from the master in a specified time interval, the worker terminates itself. If the master does not hear back from the worker, the worker is marked as failed. In this case, the graph partitions assigned to the failed worker are reassigned to the active workers. All the active workers then load the computation state from the last checkpoint and may repeat some supersteps. An alternate to this would be confined recovery where along with basic checkpointing, the workers log out-going messages from their assigned partitions during graph loading and subsequent supersteps. This way, lost partitions can be recomputed from log messages and the entire system does not have to perform a rollback. ### Worker Implementation A worker contains a mapping of vertex id to vertex state for its portion of the complete graph. The vertex state would comprise of its current value, its put-going edges, the queue containing incoming messages and a flag marking whether it is in the active state. Two copies of queue and flag are maintained, one for the current superstep and one for the next superstep. While sending a message to another vertex, the worker checks if the destination vertex is on the same machine. If yes, it places the message directly on the receiver's queue instead of sending it via the network. In case the vertex lies on the remote machine, the messages are buffered and sent to destination worker as a single network message. If a combiner is specified, it is applied to both the message being added to the outgoing message queue and to the message received at the incoming message queue. ### Master Implementation The master coordinates the workers by maintaining a list of currently alive workers, their addressing information and the information on the portion of graph assigned to them. The size of master's data structure depends on the number of partitions and a single master can be used for a very large graph. The master sends the computation task to workers and waits for a response. If any worker fails, the master enters recovery mode as discussed in the section on fault tolerance. Otherwise, it proceeds to the next superstep. It also runs an internal hHTTP server to serve statistics about the graph and the state of the computation. ### Aggregators Workers combine all the values supplied to an aggregator, by all the vertices in a superstep, into a single local value. At the end of the superstep, the workers perform the tree-based reduction on the local value and deliver the global values to the master. The tree-based reduction is better than pipelining with a chain of workers as it allows fro more parallelization. ### Applications/Experiments The paper has described how to implement PageRank, ShortestPath, Bipartite Matching and Semi Clustering algorithm on top of Pregel. The emphasis is on showing how to think of these algorithms in a vertex-centric manner and not on how to implement them on Pregel in the best possible way. The experiments were conducted with the single-source shortest paths algorithm with input as binary trees and log-normal graphs. Default partitioning strategy and naive implementation of the algorithms was used to show that satisfactory performance can be achieved with little coding effort. The runtime increases approximately linearly in the graph size. ### Limitations One obvious limitation is that the entire computation state resides in main memory. Secondly, Pregel is designed around sparse graphs and performance will take a hit in case of dense graphs where a lot of communication takes place between vertices. The paper counters this by arguing that realistic dense graphs and algorithms with dense computation are rare. Moreover, communication in such dense networks can be reduced by using aggregators and combiners. An add-on would be to support dynamic partitioning of graph based on message traffic to minimize communication over the network. Pregel's open-source implementation, called [Giraph](http://giraph.apache.org/), adds several features beyond the basic Pregel model, including out-of-core computation, and edge-oriented input which does take away some of the original limitations. Facebook is using Giraph to analyze its social network and has [scaled it to a trillion edges](https://code.facebook.com/posts/509727595776839/scaling-apache-giraph-to-a-trillion-edges/) showing the scalability of the Pregel model itself. |
[link]
This week I read upon [GraphX](https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx.pdf), a distributed graph computation framework that unifies graph-parallel and data-parallel computation. Graph-parallel systems efficiently express iterative algorithms (by exploiting the static graph structure) but do not perform well on operations that require a more general view of the graph like operations that move data out of the graph. Data-parallel systems perform well on such tasks but directly implementing graph algorithms on data-parallel systems is inefficient due to complex joins and excessive data movement. This is the gap that GraphX fills in by allowing the same data to be viewed and operated upon both as a graph and as a table. ### Preliminaries Let $G = (V, E)$ be a graph where $V = \{1, ..., n\}$ is the set of vertices and $E$ is the set of $m$ directed edges. Each directed edge is a tuple of the form $(i, j) \in E$ where $i \in V$ is the source vertex and $j \in V$ is the target vertex. The vertex properties are represented as $P_V(i)$ where $i \in V$ and edge properties as $P_E (i, j)$ for edge $(i, j) \in E$. The collection of all the properties is $P = (P_V, P_E)$. The combination of graph structure and properties defines a property graph $G(P) = (V, E, P)$. Graph-Parallel Systems consist of a property graph $G = (V, E, P)$ and a vertex-program $Q$ that is instantiated simultaneously on all the vertices. The execution on vertex $v$, called $Q(v)$, interacts with execution on the adjacent vertices by message passing or shared state and can read/modify properties on the vertex, edges and adjacent vertices. $Q$ can run in two different modes: * **bulk-synchronous mode** - all vertex programs run concurrently in a sequence of super-steps. * **asynchronous mode** - vertex programs run as and when resources are available and impose constraints on whether neighbouring vertex-programs can run concurrently. **Gather-Apply-Scatter (GAS)** decomposition model breaks down a vertex-program into purely edge-parallel and vertex-parallel stages. The associative *gather* function collects the inbound messages on the vertices, the *apply* function operates only on the vertices and updates its value and the *scatter* function computes the message to be sent along each edge and can be safely executed in parallel. GrapX uses bulk-synchronous model and adopts the GAS decomposition model. ### GraphX Data Model The GraphX Data Model consists of immutable collections and property graphs. Collections consist of unordered tuples (key-value pairs) and are used to represent unstructured data. The property graph combines the structural information (in the form of collections of vertices and edges) with properties describing this structure. Properties are just collections of form $(i, P_V (i))$ and $((i, j), P_E (i, j))$. The collection of vertices and edges are represented using RDDs (Resilient Distributed Datasets). Edges can be partitioned as per a user defined function. Within a partition, edges are clustered by source vertex id and there is an unclustered index on target vertex id. The vertices are hash partitioned by id and stored in a hash index within a partition. Each vertex partition contains a bitmask which allows for set intersection and filtering. It also contains a routing table that logically maps a vertex id to set of edge partitions containing the adjacent edges. This table is used when constructing triplets and is stored as a compressed bitmap. ### Operators Other than standard data-parallel operators like `filter`, `map`, `leftJoin`, and `reduceByKey`, GraphX supports following graph-parallel operators: * `graph` - constructs property graph given a collection of edges and vertices. * `vertices`, `edges` - decompose the graph into a collection of vertices or edges by extracting vertex or edge RDDs. * `mapV`, `mapE` - transform the vertex or edge collection. * `triplets` -returns collection of form $((i, j), (P_V (i), P_E (i, j), P_V (j)))$. The operator essentially requires a multiway join between vertex and edge RDD. This operation is optimized by shifting the site of joins to edges, using the routing table, so that only vertex data needs to be shuffled. * `leftJoin` - given a collection of vertices and a graph, returns a new graph which incorporates the property of matching vertices from the given collection into the given graph without changing the underlying graph structure. * `subgraph` - returns a subgraph of the original graph by applying predicates on edges and vertices * `mrTriplets` (MapReduce triplet) - logical composition of triplets followed by map and reduceByKey. It is the building block of graph-parallel algorithms. All these operators can be expressed in terms on relational operators and can be composed together to express different graph-parallel abstractions. The paper shows how these operators can be used to construct a enhanced version of Pregel based on GAS. It also shows how to express connected components algorithm and `coarsen` operator. ### Structural Index Reuse Collections and graphs, being immutable, share the structural indexes associated within each vertex and edge partition to both reduce memory overhead and accelerate local graph operations. Most of the operators preserve the structural indexes to reuse them. For operators like subgraph which restrict the graph, the bitmask is used to construct the restricted view. ### Distributed Join Optimization ##### Incremental View Maintenance The number of vertices that change between different steps of iterative graph algorithms decreases as the computation converges. After each operation, GraphX tracks which vertices have been changed by maintaining a bit mask. When materializing a vertex view, it uses values from the previous view for vertices which have not changed and ships only those vertices which are changed. This also allows for another optimization when using the `mrTriplets` operation: `mrTriplets` support an optional argument called *skipStale*. when this option is enabled, the `mrTriplets` function does not apply on edges origination from vertices that have not changed since its last iteration. This optimization uses the same bitmask that incremental views were using. ##### Automatic Join elimination GraphX has implemented a JVM bytecode analyzer that determines whether source/target vertex attributes are referenced in a mrTriplet UDF (for map) or not. Since edges already contain the vertex ids, a 3-way join can be brought down to 2-way join if only source/target vertex attributes are needed (as in PageRank algorithm) or the join can be completely eliminated if none of the vertex attributes are referenced. ### Sequential Scan vs Index Scan Using structural indices, while reduces computation cost in iterative algorithms, prevents physical data from shrinking. To counter this issue, GraphX switches from sequential scan to bitmap index scan when the fraction of active vertices drops below 0.8. Since edges are clustered by source vertex id, bitmap index scan can efficiently join edges and vertexes together. ### Other Optimizations * Though GraphX uses Spark's shuffle mechanism, it materializes shuffled data in memory itself, unlike Spark which materializes shuffle data in disk and relies on OS buffer cache to cache the data. The rationale behind this modification is that graph algorithms tend to be communication intensive and inability to control when buffers are flushed can lead to additional overhead. * When implementing join step, vertices routed to the same target are batched, converted from row-orientation to column-orientation and compressed by LZF algorithm and then sent to their destination. * During shuffling, integers are encoded using a variable encoding scheme where for each byte, the first 7 bits encode the value, and the highest order bit indicates if another byte is needed for encoding the value. So smaller integers can be encoded with fewer bytes and since, in most cases, vertex ids are smaller than 64 bits, the technique helps to reduce an amount of data to be moved. ### System Evaluation GraphX was evaluated against graph algorithms implemented over Spark 0.8.1, Giraph 1.0 and GraphLab 2.2 for both graph-parallel computation tasks and end-to-end graph analytic pipelines. Key observations: * GraphLab benefits from its native runtime and performs best among all the implementations for both PageRank and Connected Components algorithm. * For connected components algorithm, Giraph benefits from using edge cuts but suffers from Hadoop overhead. * GraphX outperforms idiomatic implementation of PageRank on Spark, benefitting from various optimizations discussed earlier. * As more machines are added, GraphX does not scale linearly but it still outperforms the speedup achieved by GraphLab (for PageRank). * GraphX outperforms Giraph and GraphLab for a multi-step, end-to-end graph analytics pipeline that parses Wikipedia articles to make a link graph, runs PageRank on the link graph and joins top 20 articles with their text. GraphX provides a small set of core graph-processing operators, implemented on top of relational operators, by efficiently encoding graphs as a collection of edges and vertices with two indexing data structures. While it does lag behind specialised systems like Giraph and GraphLab in terms of graph-parallel computation tasks, GraphX does not aim at speeding up such tasks. It instead aims to provide an efficient workflow in end-to-end graph analytics system by combining data-parallel and graph-parallel computations in the same framework. Given that it does outperform all the specialised systems in terms of end-to-end runtime for graph pipelines and makes the development process easier by eliminating the need to learn and maintain multiple systems, it does seem to be a promising candidate for the use case it is attempting to solve. |
[link]
This paper, by Yahoo, describes a new language called Pig Latin which is intended to provide a middle ground between declarative SQL-style language (which many developers find unnatural) and procedural map-reduce model (which is very low-level and rigid). It also introduces a novel, interactive debugging environment called Pig Pen. #### Overview A Pig Latin program is a sequence of steps — each step carrying out a single high-level transformation. Effectively the program specifies the query execution plan itself. The program then compiles into map-reduce jobs which are run on Hadoop (though other backends can also be plugged in). Pig Latin is more expressive than map-reduce which is essentially limited to use a one-input, two-stage data flow model. Moreover, since the map and reduce functions are opaque for each other, optimizations are hard to bake in the system itself. This limitation is also overcome with Pig Latin which allows the programmer to order the execution of individual steps by specifying the execution plan. Unlike traditional DBMS, Pig does not require data to be imported into system managed tables as it meant for offline, ad-hoc, scan-centric, read-only workloads. Pig supports User Defined Functions (UDFs) out of the box. Since it targets only parallel processing, there is no inbuilt support for operations like non-equi joins or correlated subqueries. These operations can still be performed using UDFs. #### Nested Data Model Pig supports a flexible, nested data model with 4 supported types- 1. Atom: Simple values like string or integer. eg ‘medium’ 2. Tuple: Sequence of ‘fields’ which can be of any type. eg (‘medium’, 12) 3. Bag: Collection of tuples. eg {(‘medium’, 12), ((‘github’, ‘twitter’), 12)} 4. Map: Collection of key-value pairs where keys are atoms and values can be any type. eg [‘key’ -> ‘value’, ‘anotherKey’ -> (‘another’, ‘value’)] #### Inbuilt functions 1. FOREACH — Specifies how each tuple is to be processed. The semantics require that there should be no dependence between processing of different input tuples to allow parallel processing. 2. COGROUP — Suppose we have two datasets: result = (query, url) //a query and its result url revenue = (query, amount) //a query and revenue generated by the query. Cogrouping these two datasets, we get grouped_data = COGROUP results BY query, revenue BY query. grouped_data would contain one tuple for each group. The first field of the tuple is the group identifier and the other fields are bags — one for each dataset being cogrouped.So 1st bag would correspond to results and other to revenue. A sample dataset has been shown here. COGROUP is one level lower than JOIN as it only groups together tuples into nested bags. It can be followed by an aggregate operation or cross product operation (leading to the result expected from JOIN operation). GROUP is same as COGROUP on a single dataset. 3. Nested Operations where commands can be nested within FOREACH command to process bags within tuples. Other functions include — LOAD, STORE, FILTER, JOIN, CROSS, DISTINCT, UNION and ORDER and are similar in operation to equivalent SQL operations. It may be argued as to how does Pig differ from SQL-style query language when it seems to be using similar operations. Compare the following queries which generate the same result. The first one is written in SQL (declarative) and other in Pig (procedural) SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106 good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>106 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); #### Implementation The Pig interpreter parses the commands and verifies that the referred input files and bags are valid. It then builds a logical plan for every bag that is being defined using the logical plans for the input bags, and the current command. These plans are evaluated lazily to allow for in-memory pipelining and filter reordering. The parsing and logical plan construction are independent of the execution platform while the compilation of the logical plan into a physical plan depends on the execution platform. For each COGROUP command, the compiler generates a map-reduce job where the map function assigns keys for grouping and the reduce function is initially a no-op. The commands intervening between two subsequent COGROUP commands A and B are pushed into the reduce function of A to reduce the amount of data to be materialized between different jobs. The operations before the very first COGROUP operation are pushed into the map function of A. The ORDER command compiles into two map-reduce jobs. The first job samples the input data to determine quantiles of the sort key. The second job range-partitions the input data according to the quantiles to provide equal-sized partitions, followed by local sorting in the reduce phase, resulting in a globally sorted file. The inflexibility of the map-reduce primitive results in some overheads while compiling Pig Latin into map-reduce jobs. For example, data must be materialized and replicated on the distributed file system between successive map-reduce jobs. When dealing with multiple data sets, an additional field must be inserted in every tuple to indicate which data set it came from. However, the Hadoop map-reduce implementation does provide many desired properties such as parallelism, load balancing, and fault-tolerance. Given the productivity gains to be had through Pig Latin, the associated overhead is often acceptable. Besides, there is the possibility of plugging in a different execution platform that can implement Pig Latin operations without such overheads. Since the files reside in the Hadoop distributed file system, LOAD operation can run in parallel. Similarly, parallelism is achieved for FILTER and FOREACH operation as any map-reduce job runs several map and reduce instances in parallel. COGROUP operation runs in parallel by re-partitioning the output from multiple map instances to multiple reduce instances. To achieve efficiency when working with nested bags, Pig uses Hadoop’s combiner function to achieve a two-tier tree evaluation of algebraic functions. So all UDFs, of algebraic nature, benefit from this optimization. Of course, non-algebraic functions can not take advantage of this. #### Pig Pen It is the interactive debugging environment that also helps to construct Pig Latin program. Typically the programmer would write the a program and run it on a dataset, or a subset of the dataset (if running over the entire dataset is too expensive) to check for correctness and change the program accordingly. Pig Pen can dynamically create a side data set (a subset of the complete dataset), called sandbox dataset, that can be used to test the program being constructed. This dataset is aimed to be real (ie subset of actual data), concise and complete(illustrate the key semantics of each command). While the paper does not go into the depth of how this dataset is created, it does mention that it starts by taking small, random samples of base data and synthesizes additional data tuples to improve completeness. Within Yahoo, Pig Latin has been used in a variety of scenarios like computing rollup aggregates and performing temporal and session analysis. While Pig Latin does provide a powerful nested data model and supports UDFs making it easier to write and debug map-reduce jobs, it does not deal with issues like materializing and replicating data between successive map-reduce jobs. The paper argues that this overhead is acceptable given the numerous advantages Hadoop offers. Pig has come a long way since the paper was written. A lot of new functions have been added and it now comes with an interactive shell called Grunt. Moreover, now UDFs can be written in various scripting languages and not just Java. All these changes have made Pig more powerful and accessible than before. |
[link]
The paper describes the architecture of RDDs (Resilient Distributed Datasets), what problems they can be used to solve, how they perform on different benchmarks and how they are different from existing solutions. Many generalized cluster computing frameworks, like MapReduce and Dryad, lack in two areas: 1. Iterative algorithms where intermediate results are used across multiple computations. 2. Interactive data analysis where users run ad-hoc queries on the data. One way around these problems is to use specialized frameworks like Pregel. But this leads to loss of generality. This is the problem that RDD intends to solve — by providing a general purpose, fault tolerant, distributed memory abstraction. #### RDD Overview RDDs are immutable partitioned collections that are created through deterministic operations on data in stable storage or other RDDs. They keep enough information about how they are derived from other sources (this information is called lineage). This lineage ensures that RDDs can be easily reconstructed in case of failures without having to perform explicit checkpointing. In fact, a program can not reference an RDD that it can not reconstruct after a failure. RDDs are lazy and ephemeral. They are constructed on demand and discarded after use. This allows for pipelining of many operations. For example: rawData = spark.textfile(filepath) // read data from file dataAfterApplyingFirstFilter = rawData.filter(condition1) dataAfterApplyingSecondFilter = dataAfterApplyingFirstFilter.filter(condition2) dataAfterApplyingSecondFilter.save() The execution will take place on line 4, and the two filter conditions can be merged into a single condition to avoid multiple passes over the data. #### RDD Model RDDs provide an interface based on fine-grained reads and coarse-grained updates. This means transformations (functions) are applied to all data items. These transformations can be logged to build lineage graph so as to provide fault tolerance. But this update nature makes RDDs unsuitable for applications like incremental web crawler that needs asynchronous fine-grained updates to a shared state. In such cases, DSM (Distributed Shared Memory) would be a better choice as it provides fine-grained reads and writes. Although RDDs offer many advantages over DSM. First, unlike DSM, RDDs do not need to incur checkpointing overhead. Second, RDDs, being immutable, can mitigate stragglers (slow nodes), by running backup tasks just like MapReduce. Third, since only bulk writes are supported, run time can schedule tasks based on data locality to enhance performance. Lastly, even if RDDs choose to take checkpoints (in cases where the lineage graph grows very big), consistency is not a concern because of the immutable nature of RDDs. RDDs have been implemented in Spark to provide a language integrated API. Details about this implementation have been discussed here separately. #### Representing RDDs The paper proposes a graph-based representation of RDDs where an RDD is expressed through a common interface that exposes five functions: 1. partition — represents atomic pieces of the dataset. 2. dependencies — list of dependencies that an RDD has on its parent RDDs or data sources 3. iterator —a function that computes an RDD based on its parents 4. partitioner — whether data is range/hash partitioned. 5. preferredLocation — nodes where a partition can be accessed faster due to data locality. The most interesting aspect of this representation is how dependencies are expressed. Dependencies belong to one of the two classes: 1. Narrow Dependencies — where each partition of the parent node is used by at most one child partition. For example, map and filter operations. 2. Wide Dependencies — where multiple child partitions use a single parent partition. Narrow dependencies support pipelined execution on one cluster node while wide dependencies require data from all parent partitions to be available and to be shuffled across nodes. Recovery is easier with narrow dependencies while in the case of wide dependencies, failure of a single partition may require a complete re-execution. The figure shows some examples of narrow and wide dependencies. Note that join operation defines a narrow dependency when parents are hash-partitioned and wide dependency in other cases. https://cdn-images-1.medium.com/max/800/1*9I0CaywrdzUpg7RKbGlMow.png Figure 1: Example of narrow and wide dependencies. #### Job Scheduler Whenever an “action” is executed, the scheduler builds a DAG (Directed Acyclic Graph) of stages based on the lineage graph. Each stage would contain pipelined transformations with narrow dependencies. The boundaries between different stages are the shuffle operation which are required by wide dependencies. Some of these stages may be precomputed (due to the persistence of previous computations). For remaining tasks, the scheduler uses delay scheduling to assign tasks to machines based on data locality. For wide dependencies, intermediate records are materialized on nodes holding the parent partition. ### Evaluation Spark outperforms Hadoop and HadoopBinMem for following reasons: 1. Minimum overhead of Hadoop Stack as Hadoop incurs around 25 seconds of overhead to complete the minimal requirements of job setup, starting tasks, and cleaning up. 2. Overhead of HDFS while serving data as HDFS performs multiple memory copies and a checksum to serve each block. 3. Deserialization cost to convert binary data to in-memory Java objects. Note that HadoopBinMem converts input data to low-overhead binary format and stores it in an in-memory HDFS instance. Case studies also show that Spark performs well for interactive data analysis and other user applications. One limitation of the experiments is that in all the cases comparing the 3 systems, the cluster had sufficient RAM to keep all the data in-memory. It would have been interesting to compare the performance of the three systems in the case where the cluster does not have sufficient RAM to keep the entire data in main memory. #### Comparison with existing systems RDDs and Spark learn from and improve the existing systems in many ways. 1. Data flow models like MapReduce share data through stable storage but have to incur the cost of data replication, I/O and serialization. 2. DryadLINQ and FlumeJava provide language integrated APIs and pipeline data across operators in the same query. But unlike Spark, they can not share data across multiple queries. 3. Piccolo and DSM do not provide a high-level programming interface like RDDs. Moreover, they use checkpointing and roll back which are more expensive than lineage based approach. 4. Nectar, Ceil and FlumeJava do not provide in-memory caching. 5. MapReduce and Dryad use lineage based recovery within a computation, but this information is lost after a job ends. In contrast, RDDs persists lineage information across computations. RDDs can be used to express many existing models like MapReduce, DryadLINQ, Pregel, Batched Stream Processing, etc. This seems surprising given that RDDs offer only a limited interface due to their immutable nature and coarse-grained transformations. But these limitations have a negligible impact on many parallel applications. For example, many parallel programs prefer to apply the same operation to many records to keep the program simple. Similarly, multiple RDDs can be created to represent different versions of the same data. The paper also offers an interesting insight on the question of why previous frameworks could not offer the same level of generality. It says previous frameworks did not observe that “the common cause of these problems was a lack of data sharing abstractions”. |
[link]
The paper introduced the famous MapReduce paradigm and created a huge impact in the BigData world. Many systems including Hadoop MapReduce and Spark were developed along the lines of the paradigm described in the paper. MapReduce was conceptualized to handle cases where computation to be performed was straightforward but parallelizing the computation and taking care of other aspects of distributed computing was a pain. A MapReduce computation takes as input a set of key/value pairs and outputs another set of key/value pairs. The computation consists of 2 parts: 1. Map — A function to process input key/value pairs to generate a set of intermediate key/value pairs. All the values corresponding to each intermediate key are grouped together and sent over to the Reduce function. 2. Reduce — A function that merges all the intermediate values associated with the same intermediate key. The Map/Reduce primitives are inspired by similar primitives defined in Lisp and other functional programming languages. A program written in MapReduce is automatically parallelized without the programmer having to care about the underlying details of partitioning the input data, scheduling the computations or handling failures. The paper mentions many interesting applications like distributed grep, inverted index and distributed sort which can be expressed by MapReduce logic. #### Execution Overview When MapReduce function is invoked, following steps take place: 1. The input data is partitioned into a set of M splits of equal size. 2. One of the nodes in the cluster becomes the master and rest become the workers. There are M Map and R Reduce tasks. 3. The Map invocations are distributed across multiple machines containing the input partitions. 4. Each worker reads the content of the partition and applies the Map function to it. Intermediate results are buffered in memory and periodically written back to local disk. 5. The locations are passed on to the master which passes it on to reduce workers. These workers read the intermediate data and sort it. 6. Reduce worker iterates over the sorted data and for each unique intermediate key, passes the key and intermediate values to Reduce function. The output is appended to an output file. 7. Once all map and reduce tasks are over, the control is returned to the user program. As we saw, the map phase is divided into M tasks and reduce phase into R tasks. Keeping M and R much larger than the number of nodes helps to improve dynamic load balancing and speeds up failure recovery. But since the master has to make O(M+R) scheduling decisions and keep O(M*R) states in memory, the value of M and R can not be arbitrarily large. #### Master Node The master maintains the state of each map-reduce task and the identity of each worker machine. The location of the intermediate file also moves between the map and reduce operations via the master. The master pings each worker periodically. In case, it does not her back, it marks the worker as failed and assigns its task to some other worker. If a map task fails, all the reduce workers are notified about the newly assigned worker. Master failure results in computation termination in which case the client may choose to restart the computation. #### Locality MapReduce takes advantage of the fact that input data is stored on the machines performing the computations. The master tries to schedule a map job on a machine that contains the corresponding input data. Thus, most of the data is read locally and network IO is saved. This locality optimization is inspired by active disks where computation is pushed to processing elements close to the disk. #### Backup Tasks Stragglers are machines that take an unusually long time to complete one of the last few Map or Reduce tasks and can increase the overall execution time of the program. To account for these, when a MapReduce operation is close to completion, the master schedules backup executions of remaining in-progress tasks. This may lead to some redundant computation but can help to reduce the start-to-end execution time. #### Refinements MapReduce supports a variety of refinements including: 1. Users can specify an optional combiner function that performs a partial merge over the data before sending it over the network. This combiner operation would be performed after the Map function and would reduce the amount of data to be sent over the network. 2. MapReduce can be configured to detect if certain records fail deterministically and skip those records. It is very useful in scenarios where a few missing records can be tolerated. 3. Within a given partition, the intermediate key/value pairs are guaranteed to be processed in increasing key order. 4. Side-effect is supported in the sense that MapReduce tasks can produce auxiliary files as additional outputs from their operation. 5. MapReduce can be run in sequential mode on the local machine to facilitate debugging and profiling. 6. The master runs an internal HTTP Server that exports status pages showing metadata of computation and links to output and error files. 7. MapReduce provides a counter facility to keep track of occurrence of various events like the number of documents processed. Counters like the number of key-value pairs processed are tracked by default. Other refinements include custom partitioning function and support for different input/output types. MapReduce was profiled for two computations(grep and sort) running on a large cluster of commodity machines. The results were very solid. The grep program scanned ten billion records in 150 seconds and the sort program could sort ten billion records in 891 seconds (including the startup overhead). Moreover, the profiling showed the benefit of backup tasks and that the system is resilient to node failure. Other than performance, MapReduce offers many more benefits. The user code tends to be small and understandable as the code taking care of the distributed aspect is now abstracted. So the user does not have to worry about issues like machine failure and networking error. Moreover the system can be easily scaled horizontally. Lastly, the performance is good enough that conceptually unrelated computations can be maintained separately instead of having to mix every thing together in the name of saving that extra pass over the data. |
[link]
Bigtable is the distributed, scalable, highly available and high performing database that powers multiple applications across Google — each having a different demand in terms of the size of data to be stored and latency with which results are expected. Though Bigtable is not open source itself, the paper was crucial in the development of powerful open source databases like Cassandra (borrows BigTable’s data model), HBase and LevelDB. The paper Bigtable: A Distributed Storage System for Structured Data describes the story of how BigTable is designed and how it works. #### Data Model Bigtable is a sorted, multidimensional key-value store indexed by a row key, a column key, and a timestamp. For example in the context of storing web pages, the page URL could be the row key, various attributes of the page could be the column key and the time when the page was fetched could be the timestamp. Rows are maintained in lexicographic order and row range is dynamically partitioned with each row range being referred to as a tablet. Reads over short range are very fast due to locality and all read/write operations under a row are atomic. Column keys are grouped into sets called column family. These sets are created before any data is stored. A table should have a small number of distinct column families. A column key has the following syntax — family:qualifier. For example, for family as language, the keys could be language:english or language:hindi. A row in a Bigtable can contain multiple versions of same data, indexed by timestamps, and stored in decreasing order of timestamps. Old versions can be garbage collected in the way that the client can specify that only last n entries are to be kept or entries for only last n days(hours/weeks etc) are to be kept. #### Building Blocks Bigtable uses Google File System (GFS) for storing logs and data in SSTable file format. SSTable provides an immutable, ordered (key, value) map. Immutability provides certain advantages which will be discussed later. An SSTable consists of a sequence of blocks and a block index to locate the blocks. When the SSTable is opened, this index is loaded into the main memory for fast lookup. Bigtable also uses a highly available and persistent distributed lock service called Chubby for handling synchronization issues. Chubby provides a namespace consisting of directories and small files which can be used as locks. Read and write access to a file is atomic. Clients maintain sessions with a Chubby service which needs to be renewed regularly. Bigtable depends on Chubby so much that if Chubby is unavailable for an extended period of time, Bigtable will also be unavailable. #### Implementation There are 3 major components — a library linked into the client, one master server and multiple tablet servers. Master server assigns tablets to tablet servers, load balances these servers and garbage collect files in GFS. But the client data does not move through the master, clients directly contact tablet servers for read and write operations. Tablet servers manage a set of tablets — they handle the read/write operations directed to these tablets and also split very large tablets. A 3-level hierarchy is maintained to store tablet location information. The first level is a file stored in Chubby that has the location of the root table. Root table contains the location of all tables in a METADATA tablet. Each entry in this METADATA tablet contains the location of a set of user tablets. These tablet locations are cached in the client library and can also be fetched from the above scheme by recursively moving up the hierarchy. Each tablet is assigned to one tablet server at a time. The master server keeps track of which servers are alive, which tablet is assigned to which server and which tablets are unassigned. When a tablet server restarts, it creates and acquires an exclusive lock on a unique file in the servers directory (Chubby directory) which is monitored by the master server. If the tablet server loses its exclusive lock, it will stop serving its tablets though it will attempt to reconnect as long as the file exists in the servers directory. In case the file is deleted, the tablet server kills itself. The master tracks tablets that are not being served and reassigns them. To avoid network partition issues, the master kills itself if its Chubby session expires though its tablets are not reassigned. A master performs following tasks at startup time: 1. Grab the unique master lock to prevent concurrent master instantiation. 2. Find live servers by scanning the servers directory. 3. Find what tablets are assigned to each server by messaging them. 4. Find the set of all tablets by scanning the METADATA tablet. Tablet creation, deletion or merge is initiated by the master server while tablet partition is handled by tablet servers who notifies the master. In case the notification is lost, the master would be notified when it asks for the split tablet to be loaded. Memtables are in-memory buffers to store recently committed updates to Bigtable. These updates are later written to GFS for persistent storage. Tablets can be recovered by reading the metadata from METADATA tablet which contains a set of Redo points and list of SSTables comprising the tablet. There are 3 kinds of compactions in action: 1. Minor compaction — As the memtable grows in size and reaches a certain threshold, it is frozen, a new memtable is created and the frozen memtable is written to GFS. 2. Merging Compaction — Minor compaction keeps increasing the count of SSTables. Merging compaction reads the contents of a few SSTables and the memtable and writes it to a new SSTable. 3. Major Compaction — In Major compaction, all the SSTables are written into a single SSTable. #### Refinements Along with the described implementation, several refinements are required to make the system reliable and high performing. Clients can club together column families into locality groups for which separate SSTables are generated. This allows for more efficient reads. If SSTables are small enough, they can be kept in main memory as well. SSTables can be compressed in various formats. The compression ratio gets even better when multiple versions of same data are stored. **Caching** — Tablet servers employ 2 levels of caching. 1. Scan Cache — It caches (key, value) pairs returned by the SSTable and is useful for applications that read same data multiple times. 2. Block Cache — It caches SSTable blocks read from GFS and useful when the application uses locality of reference. **BloomFilters** are created for SSTables (particularly for the locality groups). They help to reduce the number of disk access by predicting if an SSTable may contain data corresponding to a particular row, column pair. **Commit Logs** are maintained at tablet server level instead of tablet level to keep the number of log file small. Each tablet server maintains 2 log writing threads — each writing to its own and separate log file and only one of the threads is active at a time. If one of the threads is performing poorly (say due to network congestion), the writing switches to other thread. Log entries have sequence numbers to allow recovery process. We earlier saw that SSTable entries are immutable. The advantage is that no synchronization is needed during read operations. This also makes it easier to split tablets. Permanently removing deleted data is taken care of by garbage collector. |
[link]
This paper talks about how Spark SQL intends to integrate relational processing with Spark itself. It builds on the experience of previous efforts like Shark and introduces two major additions to bridge the gap between relational and procedural processing: 1. DataFrame API that provides a tight integration between relational and procedural processing by allowing both relational and procedural operations on multiple data sources. 2. Catalyst, a highly extensible optimizer which makes it easy to add new data sources and algorithms. #### Programming Interface Spark SQL uses a nested data model based on Hive and supports all major SQL data types along with complex (eg array, map, etc) and user-defined data types. It ships with a schema inference algorithm for JSON and other semistructured data. This algorithm is also used for inferring the schemas of RRDDs (Resilient Distributed Datasets) of Python objects. The algorithm attempts to infer a static tree structure of STRUCT types (which in turn may contain basic types, arrays etc) in one pass over the data. The algorithm starts by finding the most specific Spark SQL type for each record and then merges them using an associative most specific supertype function that generalizes the types of each field. A DataFrame is a distributed collection of rows with the same schema. It is equivalent to a table in an RDBMS. They are similar to the native RDDs of Spark as they are evaluated lazily, but unlike RDDs, they have a schema. A DataFrame represents a logical plan and a physical plan is built only when an output function like save is called. Deferring the execution in this way makes more space for optimizations. Moreover. DataFrames are analyzed eagerly to identify if the column names and data types are valid or not. DataFrames supports query using both SQL and a DSL which includes all common relational operators like select, where, join and groupBy. All these operators build up an abstract syntax tree (AST) of the expression (think of an expression as a column in a table), which is then optimized by the Catalyst. Spark SQL can cache data in memory using columnar storage which is more efficient than Spark’s native cache which simply stores data as JVM objects. The DataFrame API supports User-defined functions (UDFs) which can use the full Spark API internally and can be registered easily. To query native datasets, Spark SQL creates a logical data scan operator (pointing to the RDD) which is compiled into a physical operator that accesses fields of the native objects in-place, extracting only the fiel needed for a query. This is better than traditional object-relational mapping (ORM) which translates an entire object into a different format. Spark MLlib implemented a new API based on pipeline concept (think of a pipeline as a graph of transformations on the data) and choose DataFrame as the format to exchange data between pipeline stages. This makes is much easier to expose MLlib’s algorithms in Spark SQL. #### Catalyst Catalyst is an extensible optimizer based on Scala’s standard features. It supports both rule-based and cost-based optimizations and makes it easy to add new optimization techniques and data sources to Spark SQL. At its core, Catalyst is powered by a general library that represents and manipulates trees by applying rules to them. Tree is the main data type in Catalyst and is composed of node objects where a node has a type and zero or more children. Rules are functions to transform one tree to another. Trees offer a transform method that applies a pattern matching function recursively on all the nodes of the tree, transforming only the matching nodes. Rules can match multiple patterns in the same transform call and can contain arbitrary Scala code which removes the restriction of using a Domain Specific Language (DSL) only. Catalyst groups rules into batches and executes each batch until it reaches a fixed point (ie the tree stops changing). This means that each rule can be simple and self-contained while producing a global effect on the tree. Since both nodes and trees are immutable, optimizations can be easily performed in parallel as well. Spark SQL uses Catalyst in four phases: **Logical Plan Analysis** which requires resolving attribute references (one for which we do not know the type or which have not been matched to an input table). It uses a Catalog object to track the tables in all data sources to resolve references. **Logical Optimization** phase applies standard rule-based optimizations to the logical plan which include constant folding, predicate pushdown, projection pruning, null propagation, Boolean expression simplification, etc. In **Physical Planning** phase, Spark SQL generates multiple physical plans corresponding to a single logical plan and then selects one of the plans using a cost model. It also performs some rule-based physical optimizations like the previous case. In **Code Generation** phase, Catalyst uses quassiquotes (provided by Scala) to construct an AST that can be fed to Scala Compiler and bytecode can be generated at runtime. Extensions can be added even without understanding how Catalyst works. For example, to add a data source, one needs to implement a createRelation function that takes a set of key-value parameters and returns a BaseRelation object if successfully loaded. This BaseRelation can then implement interfaces to allow Spark SQL access to data. Similarly to add user-define types (UDTs), the UDTs are mapped to Catalyst’s inbuilt types. So one needs to provide a mapping from an object of UDT to a Catalyst row of built in types and an inverse mapping back. #### Future Work Some recent work has also shown that it is quite easy to add special planning rules to optimize for specific use cases. For example, researchers in ADAM Project added a new rule to use interval trees instead of using normal joins for a computational genomics problem. Similarly, other works have used Catalyst to improve generality of online aggregation to support nested aggregate queries. These examples show that Spark and Spark SQL is quite easy to adapt to new use cases as well. As I mentioned previously, I am experimenting with Spark SQL and it does look promising. I have implemented some operators and it is indeed quite easy to extend. I am now looking forward to developing more concrete thing on top of it. |
[link]
The paper introduces a new framework called Spark which focuses on use cases where MapReudce fails. While a lot of applications fit the MapReduce’s acyclic data flow model, there are use cases requiring iterative jobs where MapReduce is not greatly efficient. Many machine learning algorithms fall into this category. Secondly with MapReduce, each query incurs a significant overhead because it is effectively a different job each time. So MapReduce is not the ideal solution for doing interactive analysis. This is the space which Spark intends to fill in. Spark is implemented in Scala and now supports API in Java, Python, and R #### Programming Model The model supports RDDs, parallel operations and 2 type of shared variables. A driver program implements the high-level control flow and launches different operations in parallel. Resilient Distributed Datasets (RDDs) are a read-only collection of objects, partitioned across a set of machines in a cluster. A handle to RDD contains enough information to compute the RDD from data in case of partition failure. RDDs can be constructed: 1. From a file in a Hadoop supported file system. 2. By “parallelizing” a Scala collection. 3. By transforming an existing RDD using operations like flatMap, map, and filter. 4. By changing persistence of an existing RDD. flatMap, map and filter are standard operations as supported by various functional programming languages. For example map will take as input a list of items and return a new list of items after applying a function to each item of the original list. RDDs are lazy and ephemeral. They are constructed on demand and discarded after use. The persistence can be changed to cache which means they are still lazy but are kept in memory (or disk if they can not fit memory) or to save where they are saved to disk only. Some supported parallel operations(at time of writing the paper) include: 1. reduce: Combines dataset elements using an associative function to produce a result at the driver program. 2. collect: Sends all elements of the dataset to the driver program. 3. foreach: Passes each element through a user-provided function Since the paper was authored, Spark has come a long way and support much more transformations (sample, union, distinct, groupByKey, reduceByKey, sortByKey, join, cartesian, etc), parallel operations (shuffle, count, first, take, takeSample etc), and persistence options (memory only, memory and disk, disk only, memory only serialized, etc). Spark supports 2 kinds of shared variables: Broadcast variables — These are read-only variables that are distributed to worker nodes for once and can be used multiple times (for reading). A use case of this variable would be training data which can be sent to all the worker nodes once and can be used for learning different models instead of sending the same data with each model. Accumulators — These variables are also shared with workers, the different being that the driver program can only read them and workers can perform only associative operations on them. A use case could be when we want to count the total number of entries in a data set, each worker fills up its count accumulator and sends it to the driver which adds up all the received values. #### Implementation The core of Spark is the implementation of RDDs. Suppose we start by reading a file, then filtering the lines to get lines with the word “ERROR” in them, then we cache the results and then count the number of such lines using the standard map-reduce trick. RDDs will be formed corresponding to each of these steps and these RDDs will be stored as a link-list to capture the lineage of each RDD. https://cdn-images-1.medium.com/max/800/1*6I7aiD2bPrsw32U76Q5q2g.png Lineage chain for distributed dataset objects Each RDD contains a pointer to its parent and information about how it was transformed. This lineage information is sufficient to reconstruct any lost partition and checkpointing of any kind is not required. There is no overhead if no node fails and even if some nodes fails, only select RDDs need to be reconstructed. Internally an RDD object implements a simple interface consisting of three operations: 1. getPartitions, which returns a list of partition IDs. 2. getIterator(partition), which iterates over a partition. 3. getPreferredLocations(partition), which is used for task scheduling to achieve data locality. Spark is similar to MapReduce — it sends computation to data instead of the other way round. This requires shipping closures to workers — closures to define and process a distributed dataset. This is easy given Scala uses Java serialization. However unlike MapReduce, operations are performed on RDDs that can persist across operations. Shared variables are implemented using classes with custom serialization formats. When a broadcast variable b is created with a value v, v is saved to a file in the shared file system. The serialized form of b is a path to this file. When b’s value is queried, Spark checks if v is in the local cache. If not, it is read from the file system. For accumulators, each accumulator is given a unique ID upon creation and its serialized form contains its ID and the “zero” value. On the workers, a separate copy of the accumulator is created for each thread and is reset to “zero” value. Once the task finishes, the updated value is sent to the driver program. #### Future Work The paper describes how an early stage implementation performs on Logistic Regression, Alternating Least Square, and interactive mode. The results seem to outperform MapReduce largely because of caching the results of previous computations. This makes Spark a good alternative for use cases where same data is read into memory again and again (iterative jobs fit the category.) Spark has come a long way since the paper was written. It now supports libraries for handling SQL-like queries (SparkSQL), streaming data (Spark streaming), graphs (GraphX) and machine learning (MLlib) along with more transformations and parallel operations. I came across Spark while working at Adobe Analytics and have been reading about it to learn more. The cool thing about Spark is that it supports interactive analysis and has APIs in Python, R and Java thus making it easy to adopt. While I have not done some much work around Spark, I am looking forward to making something on top of it. |
[link]
Twitter’s Real-Time Related Query Suggestion Architecture. The paper tells the story behind architecture to support Twitter’s real-time related query suggestion, why the architecture had to be designed twice and what lessons can be learned from this exercise. It does not talk much about the algorithms, rather it talks about the different design decisions that lead to the current architecture — the focus is not on how things were done but why they were done a certain way. Twitter has an interesting use case — search assistance — which boils down to things like a user searching for “Obama” and getting results for related queries like “White House” as well. Spelling correction is also a part of search assistance. The problem is quite well researched from volume perspective of data but in Twitter’s context, velocity mattered as much as volume. The results had to adapt to rapidly evolving global conversations in real-time — where real-time loosely translates to a target latency of 10 minutes. The real-time sense is important in Twitter’s context where “relevance” has a temporal aspect as well. For example, after the Nepal Earthquake in 2015, the query “Nepal” led to results related to “earthquake” as well. Then when the new constitution was passed in Nepal, the query “Nepal” led to results related to “constitution”. The time frame in which suggestions have maximum impact is very narrow. A suggestion made too early would seem irrelevant while a suggestion made too late would seem obvious and hence less impactful. These fast moving signals have to be mixed with slow moving ones for insightful results. Sometimes the query volume is very low in which case longer observation period is needed before suggestions can be made. Twitter noticed that 17% of top 1000 query terms churn over an hourly basis which means they are no longer in top 1000 after an hour. Similarly, around 13% of top 1000 query terms are churned out every day. This suggested that a fine-grained tracking of search terms was needed. Twitter started with a basic idea: if two queries are seen in the same context, then they are related. Now they had a large open design space. For example, context can be defined by user’s search session or tweet or both. Measures like log likelihood, chi-square test etc can be used to quantify how often 2 queries appear together. To consider the temporal effect, counts are decayed time. Finally, Twitter has to combine these factors, and some more factors, together to come up with a ranking mechanism. This paper does not focus on what algorithms were chosen for these tasks, it focuses on how an end-to-end system was created. #### Hadoop Solution Twitter has a powerful petabyte-scale Hadoop-based analytics platform. Both real-time and batch processes write data to the Hadoop Distributed File System (HDFS). These include bulk exports from databases, application logs, and many other sources. Contents are serialized using either Protocol Buffers or Thrift, and LZOcompressed. There is a work-flow manager, called Oink, which schedules recurring jobs and handles dataflow dependencies between jobs. For example, if job B requires data generated by job A, A will be scheduled first. Twitter wanted to take advantage of this stack and the first version was deployed in form of a Pig script that aggregated user search sessions to compute term and cooccurrence statistics and ranked related queries on top of the existing stack. While the results were pretty good, the latency was too high and results were not available until several hours. #### Bottleneck #1 Log Imports Twitter uses Scribe to aggregate streaming log data in an efficient manner. These logs are rich with user interaction and are used by the search assistant. A Scribe daemon is running on each production server where it collects and sends local log data (consisting of category and message) to a cluster of aggregators which are co-located with a staging Hadoop cluster. This cluster merges per-category streams from the server daemons and writes the results to HDFS of the staging cluster. These logs are then transformed and moved to the main Hadoop data warehouse in chunks of data for an hour. These log messages are put in per-category, per-hour directories and are bundled in a small number of large files. Only now can the search assistant start its computations. The hierarchical aggregation is required to “roll up” data into few, large files as HDFS is not good at handling large numbers of small files. As a result, there is a huge delay from when the logs are generated to when they are available for processing. Twitter estimated that they could bring down the latency to tens of minutes by re-engineering their stack though even that would be too high. #### Bottleneck #2 Hadoop: Hadoop is not meant for latency sensitive jobs. For example, a large job could take tens of seconds to just startup — irrespective of the amount of data crunched. Moreover, the Hadoop cluster was a shared resource across Twitter. Using a scheduler (in this case, FairScheduler) is not the ideal solution as the focus is on predictable end-to-end latency bound and not resource allocation. Lastly, the job completion time depending on stragglers. For some scenarios, a simple hash partitioning scheme created chunks of “work” with varying size. This lead to large varying running times for different map-reduce jobs. For scripts that chain together Hadoop jobs, the slowest task becomes the bottleneck. Just like with log imports, Twitter estimated the best case scenario for computing query suggestions to be of the order of ten minutes. Starting with the Hadoop stack had many advantages like a working prototype was built quickly and ad hoc analysis could be easily done. This also helped them to understand the query churn and make some important observations about factors to use in search assistant. For example, Twitter discovered that only 2 sources of context — search sessions and tweets — were good enough for an initial implementation.But due to high latency, Twitter had to restrict this solution to the experimental stage itself. #### Current Architecture Firehose is the streaming API that provides access to all tweets in real time and the frontend, called Blender, brokers all requests and provides a streaming API for queries — also called query hose. These two streams are used by EarlyBird, the inverted indexing engine, and search assistant engine. Now client logs are not needed as Blender has all search sessions. Twitter search assistance is an in-memory processing engine comprising of two decoupled components: 1. Frontend Nodes — These are lightweight in-memory caches which periodically read fresh results from HDFS. They are implemented as a Thrift service, and can be scaled out to handle increased query load. 2. Backend Nodes — These nodes perform the real computations. The backend processing engine is replicated but not sharded. Every five minutes, computed results are persisted to HDFS and every minute, the frontend caches poll a known HDFS location for updated results. Request routing to the replicas is handled by a ServerSet, which provides client-side load-balanced access to a replicated service, coordinated by ZooKeeper for automatic resource discovery and robust failover. Each backend instance is a multi-threaded application that consisting of: 1. Stats collector: Reads the firehose and query hose 2. In-memory stores: Hold the most up-to-date statistics 3. Rankers: Periodically execute one or more ranking algorithm by getting raw features from the in-memory stores. There are three separate in-memory stores to keep track of relevant statistics: 1. Sessions store: Keeps track of (anonymized) user sessions observed in the query hose, and for each session, the history of the queries issued in a linked list. Sessions older than a threshold are discarded. Metadata is also tracked separately. 2. Query statistics store: Retains up-to-date statistics, like session count, about individual queries. These also include a weighted count based on a custom scoring function. This function captures things like association is more between 2 consecutively typed queries vs 2 consecutively clicked hash-tags. These weights are periodically decayed to reflect decreasing importance over time. It also keeps additional metadata about the query like its language. 3. Query cooccurrence statistics store: Holds data about pairs of co-occurring queries. Weighting and decaying are applied like in the case of query statistics store. **Query Flow** — As a user query flows through the query hose, query statistics are updated in the query statistics store, it is added to the sessions store and some old queries may be removed. For each previous query in the session, a query cooccurrence is formed with the new query and statistics in the query cooccurrence statistics store are also updated. **Tweet Flow** — As a tweet flows through the firehose, its n-grams are checked to determine whether they are query-like or not. All matching n-grams are processed just like the query above except that the “session” is the tweet itself. **Decay/Prune cycles** — Periodically, all weights are decayed and queries or co-occurrences with scores below predefined thresholds are removed to control the overall memory footprint of the service. Even user sessions with no recent activity are pruned. **Ranking cycles** — Rankers are triggered periodically to generates suggestions for each query based on the various accumulated statistics. Top results are then persisted to HDFS. #### Scalability 1. Since there is no sharding, each instance of the backend processing engine must consume the entire firehose and query hose to keep up with the upcoming data. 2. The memory footprint for retaining various statistics, without any pruning, is very large. But if the footprint is reduced, by say pruning, the quality and coverage of results may be affected. Another approach could be to store less session history and decay the weights more aggressively though it may again affect the quality of the results. #### Lessons Learnt Twitter managed to solve the problem of fast moving big data but their solution is far from ideal. It works well but only for scenario it is fine-tuned for. What is needed is a unified data platform to process for big and fast moving data with varying latency requirements. Twitter’s current implementation is an in-memory engine which mostly uses tweets and search sessions to build the context. Rich parameters like clicks, impressions etc are left out for now to keep the latency under check. Twitter described it as “a patchwork of different processing paradigms”. Though not-so-complete, it is still an important step in the direction of unifying big data and fast data processing systems. A lot of systems exists which solve the problem is pieces eg message queues like Kafka for moving data in real time and Facebook’s ptail for running Hadoop operations in real time but there is no end-to-end general data platform which can adapt itself to perform analytics on both short and long term data and combine their results as per latency bound in different contexts. |
[link]
The paper captures the story of how Facebook created a high-performing, distributed key-value store on top of Memcache (which was back then a simple, in-memory cache) and scaled it to support the world’s largest social network. In Facebook’s context, users consume much more content than they create. So the workload is read intensive and caching help to reduce the workload.MORE ON IT. Facebook uses Memcache Clusters where each Memcache instance is a demand-filled, look-aside cache. This means if a client requests data from Memcache and data is not available, the client would fetch the data from the database and would populate the cache for further requests. Memcache, not being a loading cache, does not have to worry about the logic of retrieving data from the database and can be easily plugged with multiple databases. In case of write requests, the client issues an update command to the database and a delete command is sent to Memcache. Deletion, being idempotent, is preferred over updation. While the overview seems pretty simple, there are more details in actual implementation. Facebook considers and optimizes for 3 deployments scales — a single cluster of servers, data replication between different clusters and spreading clusters across the world. #### A single cluster of servers These are characterized by a highly read intensive workload with requests having a wide fan out. Around 44% of requests contact more than 20 Memcache servers. For popular pages, this number spans well over 100 distinct servers. One reason for this is that each request returns a very small amount of data. In case of get requests, UDP performs better than TCP and get errors are treated as cache miss though insertion is not performed. This design choice seems practical as only .25% of requests fail due to late/ dropped or out of order packets. Though the response size is very small, the variation is quite large with mean = 954 bytes and median = 135 bytes. Set and delete operations are still performed over TCP (for reliability) though the connections are coalesced to improve efficiency. Within a cluster, data is distributed across hundreds of servers through consistent hashing. A very high request rate combined with large fanout leads to an all to all communication between Memcache servers and clients and even a single server can become a bottleneck for many requests. Clients construct a DAG representing the dependency between data so that more independent requests are fired concurrently. Facebook also provides a standalone proxy called mcrouter that acts as an interface to Memcache server interface and routes the requests/replies to/from other servers. Along with these, flow control mechanisms in the form of sliding window mechanism are provided to limit incast congestion. #### Lease Leases are used to address stale sets (when web server writes a stale value in the cache) and thundering herds (when a key undergoes heavy read and write operations). When a client experiences a cache-miss, Memcache gives it a lease (a 64-bit token bound to the requested key). This lease is verified by Memcache when client tries to set the value. If Memcache receives a delete request for the key, the lease is invalidated and the value can not be set. To mitigate thundering herds, Memcache returns a token only once every 10 seconds per key. If a read request comes within 10 seconds of a token issue, the client is notified to retry after a short time, by which the updated value is expected to be set in the cache. In situations where returning stale data is not much problem, the client does not have to wait and stale data (at most 10 second old) is returned. #### Memcache Pools Since different workloads can have different access patterns and requirements, Memcache clusters are partitioned into pools. The default pool is called wildcard and then there are separate pools for keys that can not reside in the wildcard pool. A small pool can be provisioned for keys for which cache miss is not expensive. Data is replicated within a pool when request rate is quite high and data can easily fit into a few machines. In Facebook’s context, replication seems to work better than sharding though it has the additional overhead of maintaining consistency. In case a few Memcache servers fail, a small set of machines, called gutters, take over. In case of more widespread failures, requests are diverted to alternate clusters. Regions Multiple front end clusters (web and Memcache clusters) along with a storage cluster (database) defines a region. The storage cluster has the authoritative copy of the data which is replicated across the frontend clusters. To handle data modifications, invalidation daemons called mcsqueal are deployed on each database which parse the queries, extract and group delete statements and broadcast them to all the front end cluster in the region. The batched delete operations are sent to mcrouter instances in each frontend cluster, which then unpack the individual deletes and route them to the concerned Memcache server. As an optimisation, a web server which modifies its data also sends invalidations to its own cluster to ensure read-after-write semantics for a single user request. #### Cold cluster Warmup When a new cluster is brought online, it takes time to get populated and initially the cache hit rate is very low. So a technique called Cold Cluster Warmup is employed where a new cold cluster is populated with data from a warm cluster instead of the database cluster. This way the cold cluster comes to full capacity within few hours. But additional care is needed to account for race conditions. One example could be: a client in cold cluster makes an update and before this update reaches the warm cluster, another request for the same key is made by the cold cluster then the item in the cold cache would be indefinitely inconsistent. To avoid this, Memcache rejects add operations for 2 seconds (called holdoff time)once a delete operation is taken place. So if a value is updated in a cold cluster and a subsequent request is made within 2 seconds, the add operation would be rejected indicating that the data has been updated. 2 seconds is chosen as most updates seems to propagate within this time. #### Across Region consistency Clusters comprising a storage cluster and several front end clusters are deployed in different regions. Only one of the regions contains the master database and rest act as replicas. This adds the challenge of synchronisation. Writes from a master region send invalidations only within the master region. Sending invalidations outside may lead to a race situation where deletes reach before data is replicated. Facebook uses mcsequal daemon helps to avoid that at the cost of serving stale data for some time. 'Writes from a non-master region are handled differently. Suppose a user updates his data from a non-master region with a large replication lag. A cache refill from a replica database is allowed only after the replication stream has caught up. A remote marker mechanism is used to minimise the probability if reading stale data. The presence of a marker indicates that data in the replica is stale and the query is redirected to the master region. When a webserver updates a key k, it sets a remote marker rk in the region, performs the write to the master database having key k and deletes k in the local cluster. When it tries to read k next time, it will experience a cache miss, will check if rk exists and will redirect its query to the master cluster. Had rk not been set, the query would have gone to the local cluster itself. Here latency is introduced to make sure most updated data is read. #### Single Server Improvements Facebook introduced many improvements for Memcache servers running as single instances as well. This includes extending Memcache to support automatic expansion of the hash table without the look-up time drifting to $O(n)$, making the server multi-threaded using a global lock and giving each thread its own UDP port to reduce contention. Memcache uses an Adaptive Slab Allocator which organizes memory into slab classes — pre-allocated, uniformly sized chunks of memory. Items are stored in smallest possible slab class which can fit the data. Slab sizes start at 64 bytes and reach up to 1 Mb. Each slab class maintains a free-list of available chunks and requests more memory in 1MB in case the free-list is empty. If no more free memory can be allocated, eviction takes place in Least Recently Used (LR) fashion. The allocator periodically rebalances slab assignments to keep up with the workload. Memory is freed in less used slabs and given to more heavily used slabs. Most of the items are evicted lazily from the memory though some are evicted as soon as they are expired. Short lived items are placed into a circular buffer of linked lists (indexed by seconds until expiration) — called Transient Item Cache — based on the expiration time of the item. Every second, all of the items in the bucket at the head of the buffer are evicted and the head advances by one. By adding a short expiration time to heavily used set of keys whose items with short useful lifespans, the proportion of Memcache pool used by this key family was reduced from 6% to 0.3% without affecting the hit rate. #### Lessons Learnt Memcache’s design decisions are driven by data and not just intuition. Facebooks seems to have experimented with a lot of configurations before arriving on decisions like using UDP for read operations and choosing the value for parameters like holdoff time. This is how it should be — data-driven decisions. In the entire process of developing Memcache, Facebook focused on optimizations which affect a good number of users and usecases instead of optimizing for each possbile use case. Facebook separated caching layer from persistence layer which makes it easier to mould the 2 layers individually. By handling the cache insertion logic in the application itself, Facebook made it easier to plug Memcache with different data stores. By modeling the probability of reading the stale data as a parameter, they allowed the latency and hit rate on the persistence store to be configurable to some extent thus broadening the scope for Memcache. Facebook breaks down a single page load request into multiple requests, thereby allowing for different kind of stale data tolerance for each component. For example, the servers serving the profile picture can cache content longer than the servers serving messages. This helps to lower the response latency and reduce the load on the data layer. Facebook’s Memcache is one of the many solutions aimed to solve a rather common problem — a scalable, distributed key-value store. Amazon’s Dynamo solves is well for a write-intensive workload for small data sizes and Facebook solves it for a read intensive workload. Other contenders in the list are Cassandra, Voldemort, Redis, Twitter’s Memcache and more. The sheer number of possible, well known solutions suggests that there are many dimensions to the problem, some of which are yet unknown and that we still do not have a single solution that can be used for all use cases. |
[link]
Amazon’s platform is built upon different techniques working together to provide a single powerful, highly-available system. One of the core components powering this system is Dynamo. There are many services in the Amazon ecosystem which store and retrieve data by primary key. Take the example of the shopping cart service. Customers should be able to view and update their cart anytime. For these services, sophisticated solutions like RDBMS, GFS, etc are an overkill as these services do not need a complex query and data management system. Instead, they need a service which only supports read and write operations on a (key, value) store where the value is a small object (less than 1Mb in size) uniquely identified by the key. The service should be scalable and highly available with a well-defined consistency window. This is what Dynamo is: a scalable, distribute, highly available key-value store that provides an “always-on” experience. #### Design Considerations Dynamo achieves high availability at the cost of weaker consistency. Changes propagate to the replicas in the background and conflict resolution is done at read time to make sure none of the write operations can fail. Dynamo uses simple policies like “last write win” for conflict resolution though applications using Dynamo may override these techniques with their own methods. Eg the cart application may choose to add items across all versions to make sure none of the items is lost. A service could depend on multiple services to get its results. To guarantee that a service returns its results in a bounded time, each dependency in the service has to return its results with even tighter bounds. As a result clients enter into a contract with servers regarding service-related characteristics like expected request distribution rate, expected latency and so on. Such an agreement is called Service Level Agreement(SLA) and must be met to ensure efficiency. SLA apply in the context of Dynamo as well. Dynamo supports incremental scaling where the system is able to scale out one node at a time. Moreover, all the nodes are symmetrical in the sense they have the same set of responsibilities. Since Dynamo is used only by Amazon’s internal applications, there are no security related requirements like authentication and authorization . #### Architecture Dynamo exposes two operations: get() and put(). get(key) returns value or list of values along with context objects corresponding to the key. put(key, context, value) stores the value and the context corresponding to the key. context objects are used for conflict resolution. To support incremental scaling, Dynamo uses consistent hashing for its partitioning scheme. In consistent hashing, the output range of a hash function is treated as a fixed circular space. Each node and data object is assigned a random value or position within this space. A data object is mapped to the first node which is placed clockwise to the position of the data object. Every data item is replicated at N hosts. So every time a data item is assigned to a node, it is replicated to N-1 clockwise successor nodes as well. The list of nodes storing a data item is called its preference list. Generally preference list contains more than N nodes to account for system and network failures. An example case is shown with N = 3. Any key between A and B would be mapped to B (by consistent hashing logic) and to C and D (by replication logic). https://cdn-images-1.medium.com/max/800/1*66VMYcQfvG3Z2acQD7aeYQ.png Each time data is created/updated, a new version of data is created. So for a given key, several versions of data (or value) can exist. For versioning, Dynamo uses vector clocks. A vector clock is a list of (node, counter) pairs. When a put operation reaches node X, the node uses the context from the put request to know which version it is updating. If there is an entry corresponding to X in vector clock, the counter is incremented else a new entry is created for node X with counter = 1. When retrieving value corresponding to a key, the node will resolve conflicts among all versions based on Dynamo’s logic or client’s logic. A likely issue with this approach is that the vector clock list may grow very large. To mitigate this, Amazon keeps evicting pairs from the list in ascending order of the time when the entry was created till the size reaches below a threshold. Amazon has not faced any issues related to loss of accuracy with this approach. They also observed that the % of data with at least 2 versions is about 0.06% Dynamo uses a quorum system to maintain consistency. For a read (or write) operation to be successful R (or W) number of replicas out of N replicas must participate in the operation successfully with the condition that R+W > N. If some of the first N replicas are not available, say due to network failure, the read and write operations are performed on the first N healthy nodes. eg if node A is down then node B can be included in its place for the quorum. In this case, B would keep track of data it received on behalf of A and when A comes online, B would hand over this data to A. This way a sloppy quorum is achieved. It is possible that B itself becomes unavailable before it can return the data to A. In this case, anti-entropy protocols are used to keep replicas synchronized. In Dynamo, each node maintains a Merkle tree for each key range it hosts. Nodes A and B exchange the roots of Merkle trees corresponding to set of keys they both host. Merkle tree is a hash tree whose leaves are hash values of individual keys and parents are hash values of children. This allows branches to be checked for replication without having to traverse the entire tree. A branch is traversed only when the hash values at the top of the branch differ. This way the amount of data to be transferred for synchronization is minimized. The nodes in a cluster communicate as per a gossip-based protocol in which each node contacts a random peer and then the two nodes reconcile their persisted membership history. This ensures an eventually consistent membership view. Apart from this, some nodes are marked as seed nodes which are known to all nodes including the ones that join later. Seed nodes ensure that logical partitions are not created within the network even when new nodes are added. Since consistent hashing is used, the overhead of key reallocation when adding a new node is quite low. #### Routing There are 2 modes of routing requests in Dynamo. In the first mode, servers route the request. The node fulfilling the request is called coordinator. If it is a read request, any node can act as the coordinator. For a write request, the coordinator is one of the nodes from the key’s current preference list. So if the write request reaches a node which is not in the preference list, it routes the request to one of the nodes in the preference list. An alternate approach would be where the client downloads the current membership state from any Dynamo node and determine which node to send the write request to. This approach saves an extra hop within the server cluster but it assumes the membership state to be fresh. #### Optimizations Apart from the architecture described above, Dynamo uses optimizations like read-repair where, during quorum, if a node returns a stale response for a read query, it is updated with the latest version of data. Similarly, since writes follow reads, the coordinator for read operation is the node that replies fastest to the previous read operation. This increases the chances of having read you write consistency. To further reduce the latency, each node maintains an object buffer in its main memory where write operations are stored and written to disk by a separate thread. The read operations also first refer the in-memory buffer before checking the disks. There is an added risk of the node crashing before writing the objects from the buffer to the disk. To mitigate this, one of the N replicas performs a durable write — that is, the data is written to the disk. Since the quorum requires only W responses, latency due to one node does not affect the performance. Amazon also experimented with different partitioning schemes to ensure uniform load distribution and adopted the scheme where hash space is divided into Q equally sized partitions and placement of partition is decoupled from the partitioning scheme. #### Lessons Learnt Although Dynamo is primarily designed as a write intensive data store, N, R and W provides ample control to modify its behavior for other scenarios as well. For example, setting R = 1 and W = N makes it a high performance read engine. Services maintaining product catalog and promotional items can use this mode. Similarly setting W = 1 means a write request is never rejected as long as at least one server is up though this increases the risk of inconsistency. Given that Dynamo allows the clients to override the conflict resolution methods, it becomes a general solution for many more scenarios than it was originally intended for. One limitation is the small size of data for which it is designed. The choice makes sense in the context of Amazon but it would be interesting to see how storing larger values affects its performance. The response time would obviously increase as more data needs to be transferred and in-memory buffers would be able to store lesser data. But using caching and larger in memory buffers, the response time may be brought down to the limit that Dynamo can be used with somewhat larger data objects as well. Dynamo scales well for a few hundred of nodes but it will not scale equally well for tens of thousands of nodes because of the large overhead of maintaining and distributing the routing table whose size increases with the number of nodes. Another problem that Amazon did not have to face was a high conflict resolution rate. They observed that around 99.94% requests saw exactly one version. Had this number been higher, the latency would have been more. All in all, Dynamo is not a universal solution for a distributed key-value store. But it solves one problem and it solves it very well. |
[link]
#### What are BLOBs Binary Large OBjects (BLOBs) refer to immutable binary data. A BLOB is created once, read many times, never modified, and sometimes deleted. In Facebook’s context, this would include photos, videos, documents, traces, heap dumps, and source code. Facebook was originally using Haystack as its BLOB storage system. But it is designed for IO bound workloads and not storage efficiency. To take a more specific example, when we post a new picture, a BLOB is created. Its data is written to the BLOB storage system and a handle corresponding to this data is stored in a Graph Store for quick retrieval later on. Now when someone clicks on that picture, a read request is fired and the handle is fetched from the graph store. Using this handle, the data (in this case the picture) can be retrieved. A CDN is in place to cache frequently accessed BLOBs. #### BLOB Storage Design BLOBs are aggregated into logical volumes. These volumes are initially unlocked and support reads, writes, and deletes. Once a volume is full, it gets locked and no more creates are allowed. Each volume consists of three files: a data file to hold BLOBs and their metadata, an index file which is a snapshot of the in-memory index of the storage machine, and a journal file to track which BLOBs have been deleted. For fault tolerance and performance, each logical volume maps into multiple physical volumes with each volume living entirely on one Haystack host. Haystack has fault tolerance to disk, host, rack, and data center failure, at an effective replication-factor of 3.6. All seems good so far but the problem is with the large and ever increasing storage footprint of the BLOBs. Facebook made an interesting observation in this regard — there is a large and increasing number of BLOBs with low request rates. So for these BLOBs, triple replication is an overkill. If only there was a separate BLOB storage system for these BLOBs. This is where f4 comes into the picture. #### Warm vs Hot Facebook benchmarked their existing implementation with a 2 week trace and observed that a kind of temperature zone exists where some BLOBs have a very high request rate (these are called the hot BLOBs) and some have a low request rate (these are called the warm BLOBs). f4 is proposed as an implementation for warm BLOB storage while Haystack would be used for hot BLOBs. Another observation was that age and temperature of BLOB are correlated. New BLOBs are queried and deleted at a much higher rate. Lastly, they observed that warm content is large and growing which furthers the need for f4. #### Design goals f4 was designed to reduce the effective-replication-factor without compromising on reliability and performance. #### Overview f4 is comprised of a number of cells, each cell residing completely in one data center. A cell is responsible for reliably storing a set of locked volumes and uses Reed-Solomon coding to store these volumes with a lower storage overhead. The downside is an increased rebuild and recovery time under failure and lower maximum read throughput. Since f4 cells support only read and delete operations, only data and index files are needed and both are read-only. For tracking deletes, all BLOBs are encrypted with keys stored in an external database. Deleting the key for a BLOB in f4 logically deletes it. The index files use triple replication within a cell and the data file is encoded and stored via a Reed-Solomon (n,k) code. The file is logically partitioned into contiguous sequences of n blocks, each of size b. For each such sequence of n blocks, k parity blocks are generated, making the size of the stripe n + k blocks. For a given block in a stripe, the other blocks in the stripe are called its companion blocks. BLOBs can be read directly from their data block. If a block is unavailable it can be recovered by decoding any n of its companion and parity blocks. The block-size for encoding is kept quite large (around 1 Gb) as it decreases the number of BLOBs that span multiple blocks, thereby reducing I/O operations to read and it reduces the amount of per-block metadata that f4 needs to maintain. But a very large size would mean a larger overhead for rebuilding the blocks. #### Components 1. Name Node: This node maintains the mapping between data blocks and parity blocks and the storage nodes that hold the actual blocks. 2. Storage Nodes: These nodes expose two APIs - an Index API that provides existence and location information for volumes, and a File API that provides access to data. 3. Backoff Nodes: These nodes are storage-less, CPU-heavy nodes that handle the online reconstruction of request BLOBs (not the entire block). 4. Rebuilder Nodes: They are similar to Backoff nodes, but they handle the offline reconstruction of entire blocks. 5. Coordinator Nodes: These nodes are storage-less, CPU-heavy nodes that handle maintenance task, such as scheduling block rebuilding and ensuring an optimal data layout. A separate geo-replicated XOR coding scheme is also used to tolerate data center or geographic region failure. In this, each volume/stripe/block is paired with a buddy volume/stripe/block in a different geographic region. Then an XOR of the buddy blocks (called as XOR Block) is stored in a third region. The effective replication factor turns out to be 2.1 #### Are the results good? Results are amazing! With a corpus of 65PB, f4 will save Facebook over 39 PB and 87 PB of storage at effective-replication-factor of 2.8 and 2.1 respectively. All this comes with low latency and fault tolerance. #### Lessons learnt The existence of temperature zones is not unique to Facebook. Such zones would be present in all kind of data at a large-enough scale with the line seperating these zones depending on the request and delete rates. Since older data is likely to have a different query rate than newer data, efficient migration from hot to warm storage before putting to cold storage needs to be explored more. This also suggests that one single data management system can not handle data of all ages as the constraints on the data start to change. In Facebook’s context, any data written to Haystack was constrained by at-most-one-writer requirement while data on f4 was free of this constraint. So 2 different data models, each optimized for its own use case could be used. Till now we have seen data management systems based on nature of data (relational or NoSQL), or based on nature of queries (read vs write-oriented). But this case study opens the door for a new kind of system which migrates data from one data model to another based on temperature zones. This is what Facebook ended up creating for this particular scenario. This case study also reinforces the superiority of modular architecture. Facebook has a clear need of separate data storage mechanism but what made it possible was its modular architecture which allowed for easy migration of data from Haystack to f4. Apparently Facebook’s overall architecture is designed to enable warm storage. For example, the caching stack would cache the results related to the most popular content which is expected to be newer content. Haystack can handle most of the reads and deletes thereby reducing the load on f4 and so on. I would be looking out for similar case studies from other data giants as well. Probably they are tackling this problem from an entirely different perspective. Whatever their approaches may be, it would be interesting to compare them with f4. |
[link]
It presents an informal discussion about do’s and dont’s of research — more specifically experimental study of algorithms. The author presents 10 important principles backed by examples from his long experience. I am listing points from the reading. #### 1. Solve a problem worth solving. Take some time to find a good problem. Decide what questions you want to answer before you actually start finding an answer. Finding a new solution would take up time and resources. So think before you compute. Do exploratory research and experimentation. But do not get into the endless loop of exploration. #### 2. Tie your paper to the literature. Know the prior art for your problem. Find what are the existing solutions. Think if another solution is actually needed and if yes, what needs to be improved. Tell how your work relates to existing literature. This will help you and your readers. It will make your experiments newsworthy. #### 3. Use instance testbeds that can support general conclusions. When doing research, especially experimental research, you are highly likely to use data — for making some hypothesis, for validating results and so on. More general your data set is, more general or applicable your results would be. Applying conclusions drawn from a specific data set on a general problem would be disastrous. #### 4. Document everything. Document your code. Document your data sets. Document your experimental setup, your results, your mistakes, your conclusions. Document everything that some other person would need to replicate your experiments. This will help your future-self save some time and he would be grateful to you. Read #6. When using randomly generated data sets, use the same data set for benchmarking all approaches. It would reduce time and variance. Use sampling and bootstrapping. #### 5. Use reasonably efficient implementations. Seems obvious. But efficiency requires effort and we may be tempted especially when execution time in not a metric of interest. More efficient code means you can perform more experiments and on larger data sets. But do not over-optimise. Remember — Premature optimization is the root of all evil. #### 6. Ensure Reproducibility. Provide information on how to reproduce your experiments. Read #4. See gitxiv. Share your code and data set with others. If the data set is generated using some algorithm, share its details. Use reproducible standards of comparison. Do not say “results after running the algorithm for 1 hour”. Machines are evolving every day and hence doing more operations than before for the same amount of time. #### 7. Ensure Compatibility. Comparability goes beyond reproducibility and documentation. Benchmark the machine speed and the other experimental setup that you use so that other researchers can accurately compare their result with yours. Take a full backup of your code and data. #### 8. Report the full story. Report the complete story, including the parts you do not like. Do not hide anomalous results. Instead, try to explain them. It could lead to more insights to the problem and to the solution. Report the full running time, even if time is not a metric of interest from the problem’s point of view. Many times readers want to see how your solution performs on the scale of time. #### 9. Draw well-justified conclusions and look for explanations. Summarize trends, infer implications and provide conclusions. If there are no conclusions and inferences, then you fail the newsworthiness test. Probably you picked the wrong question or your solution is incomplete. Justify your conclusions with data. If needed, use techniques like profiling. #### 10. Present Your Data in Informative Ways. Your data representation should highlight what you want the reader to infer from the data. This can include things like appropriate ordering of columns in a table or choice of variables to be shown along the axis. If you want the reader to compare percentage rise in time, provide a column for that. Do not make the reader do the arithmetic. Take care of not overwhelming your readers with data alone. Put additional data in appendix. |
[link]
The structure of the dictionary is quite simple. Let us say that we have to maintain a set $S$, of size $n$, and keys are taken from a universe $U$. The dictionary uses two hash tables, $T$ and $T’$, each consisting of $r$ slots (where $r$ is slightly bigger than $n$). We have two hash functions $h$, $h’$: $U → [r]$. Every key $x \in S$ will either be stored in cell $h(x)$ of T or in cell $h’(x)$ of $T’$. This means lookup takes $O(1)$ time as only 2 slots needs to be checked. Deletion and updation also take $O(1)$ time as they first lookup the key and then delete or update the corresponding value. Insertion is more interesting and involved than other operations because it uses, what the paper calls, “the cuckoo approach” where a key kicks other keys to it finds a nest for itself. To insert $x$, we check cell $h(x)$ of $T$. If it is free then we are done else we update the value of this cell which would make the previous key “nestless”. Then we insert this newly displaced key in $T’$ in the same manner iteratively till all keys find a nest. The number of iterations is bound by a constant, MaxLoop, to avoid closed loops. After MaxLoop number of iterations, a rehash is performed. A rehash will also be performed once $r^2$ insertions have been performed since the last rehash. This is all one needs to know to implement their own cuckoo-hash. Simple and straightforward. ### Digging Deeper The approach may, at first, sound like shifting the overhead of lookup to insertion, but there is more to it. We first need to understand what is a universal family. A family of functions, where each function is defined from $U → R$, is (c, k)-universal if, for any $k$ distinct $x$’s $\in U$ , and any $k y$’s $\in R$, probability that any randomly chosen function from this family maps these $x$’s to $y$’s ≤ c/|R|^k . In effect, it puts a limit on the probability of collisions when randomly picking hashing functions from this family. The paper cites a paper by Alan Siegel to explain how to construct a hash family such that probability of collisions with a set of $r^2$ keys is $O(1/n^2)$. The paper also uses a clever trick to determine when a rehash would be needed before reaching MaxLoop iterations. If the insertion operation returns to a previously visited cell, it checks if a close loop is going to be formed. Suppose we are inserting key $x_1$ and let $x_1$, $x_2$, …, $x_p$ be the sequence of nestless keys. One of these keys, say xi, becomes nestless for the second time. So it will be put back to its original position. This will cause all the the previous keys in the sequence to be moved back to their original position and $x_1$ would be nestless again. So it will be put to the other table this time. This would cause some keys to becomes nestless in the second table. If one of those keys say $x_k$ moves to a previously visited position, we are guaranteed to have a closed loop. In that case, we do not wait for MaxLoop iterations to be reached and perform rehash. Now we can work out the probability of insertion loop running for at least t iterations. We already know that t can not exceed MaxLoop. Adding up the probabilities for all possible values of t and accounting for the cost of rehash and taking n to be very large, we get the amortized insertion time to be $O(1)$. For a detailed understanding of this part, refer the original paper. ### My thoughts The paper has very strong mathematical and experimental backing. It is full of references related to both theoretical work and experimental evaluation. The analysis of insertion is clean and meticulous and all the maths work out beautifully. I particularly liked the elaborate and well-documented experiments. Authors have experimented with a variety of hashing algorithms along with variants of cuckoo hashing. They have highlighted the instances where their implementation differs from the reference and have provided justifications for the same. They have also studied the role of cache in all the hashing techniques in their experiments. They also tested with the dictionary tests of DIMACS implementation challenge. In all experiments, Linear Probing performed better than all other approaches with Cuckoo-Hashing lagging just 20–30% behind. Even then paper presents a strong case for Cuckoo-Hashing. A few things can still be explored more like how to go about choosing a family of good hash functions as per the constraints. Secondly cache miss rate tends to increase the fastest for cuckoo hashing as load factor increases. Also, it is not possible to have a load factor of more than 50% for cuckoo hashing. So in some cases, the dictionary will have to be resized. The experiments have focused on situations where the size does not vary greatly. Also insertion is expected to be $O(1)$ only for very large values of n though there is no estimate of how big n needs to be. This could be persuaded further. |
[link]
A never-ending learning problem has 2 components — a set of learning tasks and a set of coupling constraints. A learning task in this paradigm is same as the learning task in any other paradigm — improving the system’s performance, as measured by some metric P, over a task T given some experience E. Each coupling constraint can be thought of as a function, defined over 2 or more learning tasks, which specifies the degree of satisfaction of the constraint. Given such a learning problem, a never-ending learning agent A produces a sequence of solutions to the individual learning tasks such that, over time, the quality of the individual learning functions as well as the degree to which each coupling constraint is satisfied both increases. To take a simplified example, consider the problem of Google classifying emails in our inbox. Let us say we have 2 learning tasks going on — one which learns whether to put a mail in spam or not and another weather to mark a mail important or not. An obvious constraint here would be that any mail which is marked as spam must not be marked important (though not the other way round). Think of them as the set of rules that you do not want to see violated. What makes coupling constraints important and powerful is that the learning agent can improve its learning of one function by successfully learning other functions. The paper also presents the case study of a program called the Never-Ending Language Learner (NELL). NELL implements some of the features of this new paradigm and has been in action 24X7 since January 2010. Every day it extracts beliefs from the web and updates its knowledge base by removing incorrect beliefs and adding the ones which, it believes, are correct. NELL started with an initial ontology of categories and some labelled examples and relations and has got over 80 million interconnected beliefs in its Knowledge Base (KB) so far. It is performing over 2500 learning tasks which include: - Category classification: eg Apple is a food and a company. - Relationship classifications: eg (Apple, Company, “IsA”). - Entity Resolution: e.g “NYC” and “Big Apple” can refer to the same entity. - Inferring Rules among belief triples. For all these tasks, it uses a variety of classification methods. The coupling constraints include: - Multi-view co-training coupling: Different classifiers should give same output on the same input. - Subset/superset coupling: If a category A is a subset of category B then each phrase belonging to A must belong to B. - Multi-label mutual exclusion coupling: When 2 categories are declared to be mutually exclusive, none of the noun phrases should lie in both categories. - Coupling relations to their argument types: In constraints like LivesIn(A, B), A should be a person and B should be a place. - Horn clause coupling. Each reading and inference module (based on above classifications and constraints) sends proposed updates in the KB to a Knowledge Integrator (KI) which makes a final decision on all these updates. Once the updates are made, all the modules are re-trained based on the updated KB. Due to sheer size of the KB, it is not possible to consider each and every candidate belief so the KI would consider only high confidence candidate beliefs and would re-assess confidence using a limited subgraph of consistency constraints and beliefs. This means many iterations are required for the effect of constraints to propagate throughout the graph. The paper mentions a more effective algorithm to achieve this propagation. To add learning tasks and ontology extension, it extracts sentences mentioning different categories, build a context by context co-occurrence matrix and then cluster the related contexts together. Now each cluster corresponds to a candidate relation. These candidates go through a trained classifier and manual filtering before they are added to the ontology. An empirical analysis of the systems performance shows that: - The system is improving its reading competence over time as was expected and desired. - Its KB is increasing but at a decreasing rate as it gets difficult to extract new knowledge from less frequently mentioned beliefs. The paper mentions some places for improvement: - Adding a self-reflection capability to NELL so that it can detect where it is doing well and where it is doing poor and allocate its resources more intelligently. This feature is a part of the paradigm itself. - Broaden the scope of NELL by using other data sources as well eg Never-Ending Image Learner (NEIL) uses image data. - Merge other ontologies like DBPedia to boost to its ontology. - Use ”micro-reading” methods to NELL perform deep semantic analysis The never-ending learning paradigm raises 2 fundamental questions: 1. Under what conditions is an increasingly consistent learning agent also an increasingly correct agent? — This is important because an autonomous agent can perceive consistency but not correctness. 2. Under what conditions is convergence guaranteed in principle and in practice? — The architecture may not have sufficient self-modification operations to ensure never-ending learning or these operations may not be practical due to limits of computation and/or training experience. ### My thoughts What makes never-ending learning different, and in some cases more powerful, from conventional paradigms are the concepts of coupling constraints and never ending learning. As a learning model, it seems closer to the human learning model. We try to learn and take actions by relating different scenarios. Our actions and decisions are constrained by both variables and our other actions and decisions. Constraint coupling seems to capture this requirement. Then there are scenarios where conventional machine learning approaches would fail. Learning is not always about throwing in more data or introducing new variables. If the domain is evolving rapidly, there would always be newer data coming in and newer variables that are not accounted for. These are the kind of scenarios where this paradigm can fill the gap. Another aspect is that all this work builds on top of existing work. All the algorithms used for the various classifications are existing ones. The paradigm does not suggest a new algorithm for any of the individual learning problems. Instead, it provides a mechanism where success in learning one function helps in learning others. The paper has strongly put forward the case for this new paradigm. There is a case study to evaluate the model practically, they start out with a small labeled dataset, the reported metrics are behaving as desired and the web is the best choice for applying a never-ending learning algorithm as the web is the never-ending growing domain. One criticism is that the paper does not mention how resource-intensive NELL is — other than mentioning about the dataset. Even time-based metrics are missing. Not that I expect such a system to be frugal, but I would none-the-less be interested in knowing about their computing infrastructure and time-based metrics. There is still a lot to be explored about the effectiveness of this model. Two prime questions are already listed above. Other than that, the model needs firm mathematical footing. It also needs to be put to test in other domains as well. NEIL is one of the extensions of this. I would be interested to see how does this approach plays out in other domains and what kind of ontologies are obtained especially in case of social networks which are both data-rich and constantly evolving. |
[link]
## Introduction * [Link to Paper](http://arxiv.org/pdf/1412.6071v4.pdf) * Spatial pooling layers are building blocks for Convolutional Neural Networks (CNNs). * Input to pooling operation is a $N_{in}$ x $N_{in}$ matrix and output is a smaller matrix $N_{out}$ x $N_{out}$. * Pooling operation divides $N_{in}$ x $N_{in}$ square into $N^2_{out}$ pooling regions $P_{i, j}$. * $P_{i, j}$ ⊂ $\{1, 2, . . . , N_{in}\}$ $\forall$ $(i, j) \in \{1, . . . , N_{out} \}^2$ ## MP2 * Refers to 2x2 max-pooling layer. * Popular choice for max-pooling operation. ### Advantages of MP2 * Fast. * Quickly reduces the size of the hidden layer. * Encodes a degree of invariance with respect to translations and elastic distortions. ### Issues with MP2 * Disjoint nature of pooling regions. * Since size decreases rapidly, stacks of back-to-back CNNs are needed to build deep networks. ## FMP * Reduces the spatial size of the image by a factor of *α*, where *α ∈ (1, 2)*. * Introduces randomness in terms of choice of pooling region. * Pooling regions can be chosen in a *random* or *pseudorandom* manner. * Pooling regions can be *disjoint* or *overlapping*. ## Generating Pooling Regions * Let $a_i$ and $b_i$ be 2 increasing sequences of integers, starting at 1 and ending at $N_{in}$. * Increments are either 1 or 2. * For *disjoint regions, $P = [a_{i−1}, a_{i − 1}] × [b_{j−1}, b_{j − 1}]$ * For *overlapping regions, $P = [a_{i−1}, a_i] × [b_{j−1}, b_j 1]$ * Pooling regions can be generated *randomly* by choosing the increment randomly at each step. * To generate pooling regions in a *peusdorandom* manner, choose $a_i$ = ceil($\alpha | (i+u))$, where $\alpha \in (1, 2)$ with some $u \in (0, 1)$. * Each FMP layer uses a different pair of sequence. * An FMP network can be thought of as an ensemble of similar networks, with each different pooling-region configuration defining a different member of the ensemble. ## Observations * *Random* FMP is good on its own but may underfit when combined with dropout or training data augmentation. * *Pseudorandom* approach generates more stable pooling regions. * *Overlapping* FMP performs better than *disjoint* FMP. ## Weakness * No justification is provided for the observations mentioned above. * It needs to be seen how performance is affected if the pooling layer in architectures like GoogLeNet. |