Welcome to ShortScience.org! |
[link]
Sinha et al. introduce a variant of adversarial training based on distributional robust optimization. I strongly recommend reading the paper for understanding the introduced theoretical framework. The authors also provide guarantees on the obtained adversarial loss – and show experimentally that this guarantee is a realistic indicator. The adversarial training variant itself follows the general strategy of training on adversarially perturbed training samples in a min-max framework. In each iteration, an attacker crafts an adversarial examples which the network is trained on. In a nutshell, their approach differs from previous ones (apart from the theoretical framework) in the used attacker. Specifically, their attacker optimizes $\arg\max_z l(\theta, z) - \gamma \|z – z^t\|_p^2$ where $z^t$ is a training sample chosen randomly during training. On a side note, I also recommend reading the reviews of this paper: https://openreview.net/forum?id=Hk6kPgZA- Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
## Very Short Summary The authors introduce a number of modifications to traditional hessian-free optimisation that makes the method work better for neural networks. The modifications are: * Use the Generalised Gauss Newton Matrix (GGN) rather than the Hessian. * Damp the GGN so that $G' = G + \lambda I$ and adjust $\lambda$ using levenberg-marquardt heuristic. * Use an efficient recursion to calculate the GGN. * Initialise each round of conjugated gradients with the final vector of the previous iteration. * A new simpler termination criterion for CG. Terminate CG when the relative decrease in the objective falls below some threshold. * Back-tracking of the CG solution. ie you store intermediate solutions to CG and only update if the new CG solution actually decreases the over all problem objective. ## Less Short Summary ### Hessian Free Optimisation in General Hessian free optimisation is used when one wishes to optimise some objective $f(\theta)$ using second order methods but inversion or even computation of the Hessian is intractable or infeasible. The method is an iterative method and at each iteration, we take a second order approximation to the objective. i.e at iterantion n, we take a second order taylor expansion of $f$ to get: $M^n(\theta) = f(\theta^n) + \nabla_{\theta}^Tf(\theta^n)(\theta - \theta^n) + (\theta - \theta^n)^TH(\theta - \theta^n) $ Where $H$ is the hessian matrix. If we minimise this second order approximation with respect to $\theta$ we would find that that $\theta^{n+1} = H^{-1}(-\nabla_{\theta}^Tf(\theta^n))$. However, inverting $H$ is usually not possible for even moderately sized neural networks. There does however exist an efficient algorithm for calculating hessian vector products $Hv$ for any $v$. The insight of hessian-free optimisation is that one can solve linear problems of the form $Hx = v$ using only hessian vector products via the linear conjugated gradients algorithm. You therefore avoid the need to ever actually compute either the Hessian or its inverse. To run vanilla hessian free all you need to do at each iteration is: 1) Calculate the gradient vector using standard backprop. 2) Calculate $H\theta$ product using an efficient recursion. 3) calculate the next update $\theta^{n+1} = ConjugatedGradients(H, -\nabla_{\theta}^Tf(\theta^n))$ The main contribution of this paper is to take the above algorithm and make the changes outlined in the very short summary. ## Take aways Hessian-Free optimisation was perhaps the best method at the time of publication. Recently it seems that first order methods using per-parameter learning rates like ADAM or even learning-to-learn can outperform Hessian-Free. This is primarily because of the increased cost per iteration of Hessian Free. However it still seems that using curvature information if its available is beneficial though expensive. More resent second order curvature appoximations like Kroeniker Factored Approximate Curvature (KFAC) and Kroeniker Factored Recursive Approximation (KFRA) are cheaper ways to achieve the same benefit. |
[link]
Huang et al. propose a variant of adversarial training called “learning with a strong adversary”. In spirit the idea is also similar to related work [1]. In particular, the authors consider the min-max objective $\min_g \sum_i \max_{\|r^{(i)}\|\leq c} l(g(x_i + r^{(i)}), y_i)$ where $g$ ranges over expressible functions and $(x_i, y_i)$ is a training sample. In the remainder of the paper, Huang et al. Address the problem of efficiently computing $r^{(i)}$ – i.e. a strong adversarial example based on the current state of the network – and subsequently updating the weights of the network by computing the gradient of the augmented loss. Details can be found in the paper. [1] T. Miyato, S. Maeda, M. Koyama, K. Nakae, S. Ishii. Distributional Smoothing by Virtual Adversarial Training. ArXiv:1507.00677, 2015. Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
This paper presents a model that can dynamically split computation across coarse, low-capacity sub-networks and fine, high-capacity sub-networks. The coarse model processes the entire input data and is typically shallow while the fine model focuses on a few important regions of the input and is deeper. For images as input, this is a hard attention mechanism that can be trained with stochastic gradient descent and doesn't require a task-specific attention policy trained by reinforcement learning. Key ideas: - A deep network h can be decomposed into bottom layers f and top layers g such that $h(x) = g(f(x))$. Further, f consists of two alternate sub-networks $f\_c$ and $f\_f$. $f\_c$ is a low-capacity sub-network while $f\_f$ is a high-capacity sub-network. - g should be able to use representations from $f\_c$ and $f\_f$ dynamically. $f\_c$ processes the entire input while $f\_f$ only a few important regions of the input. - The coarse model processes the entire input and the norm of the gradient of the entropy with respect to the coarse vector at each spatial region is computed which is a measure of saliency. The use of the entropy gradient as a saliency measure encourages selecting input regions that could affect the uncertainty in the model’s predictions the most. - The top-k input regions with highest saliency values are processed by the fine model. The refined representation for input to the top layers consists of both coarse and fine vectors. During backpropagation, gradients are computed for the refined model, i.e. propagating gradients at each position into either the coarse or fine features, depending on which was used. - To make sure $f\_c$ and $f\_f$ representations are interchangeable and input to the top layers has smooth transitions, an additional objective term minimizes the squared distance between coarse and fine representations and this additional term is used only to optimize the coarse layers, not the fine layers. - Experiments on cluttered MNIST, SVHN and comparison with RAM, DRAW and study with various values of number of patches for fine processing. ## Strengths - Neat, general way to split computation based on importance of input; a hard-attention mechanism that can be trained with SGD, unlike RAM. - Entropy gradient as a measure of saliency is an interesting idea, and it doesn't need labels i.e. can be used at test time. |
[link]
The [paper](http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf) presents Twitter's logging infrastructure, how it evolved from application specific logging to a unified logging infrastructure and how session-sequences are used as a common case optimization for a large class of queries. ## Messaging Infrastructure Twitter uses **Scribe** as its messaging infrastructure. A Scribe daemon runs on every production server and sends log data to a cluster of dedicated aggregators in the same data center. Scribe itself uses **Zookeeper** to discover the hostname of the aggregator. Each aggregator registers itself with Zookeeper. The Scribe daemon consults Zookeeper to find a live aggregator to which it can send the data. Colocated with the aggregators is the staging Hadoop cluster which merges the per-category stream from all the server daemons and writes the compressed results to HDFS. These logs are then moved into main Hadoop data warehouse and are deposited in per-category, per-hour directory (eg /logs/category/YYYY/MM/DD/HH). Within each directory, the messages are bundled in a small number of large files and are partially ordered by time. Twitter uses **Thrift** as its data serialization framework, as it supports nested structures, and was already being used elsewhere within Twitter. A system called **Elephant Bird** is used to generate Hadoop record readers and writers for arbitrary thrift messages. Production jobs are written in **Pig(Latin)** and scheduled using **Oink**. ## Application Specific Logging Initially, all applications defined their own custom formats for logging messages. While it made it easy to develop application logging, it had many downsides as well. * Inconsistent naming conventions: eg uid vs userId vs user_Id * Inconsistent semantics associated with each category name causing resource discovery problem. * Inconsistent format of log messages. All these issues make it difficult to reconstruct user session activity. ## Client Events This is an effort within Twitter to develop a unified logging framework to get rid of all the issues discussed previously. A hierarchical, 6-level schema is imposed on all the events (as described in the table below). | Component | Description | Example | |-----------|------------------------------------|----------------------------------------------| | client | client application | web, iPhone, android | | page | page or functional grouping | home, profile, who_to_follow | | section | tab or stream on a page | home, mentions, retweets, searches, suggestions | | component | component object or objects | search_box, tweet | | element | UI element within the component | button, avatar | | action | actual user or application action | impression, click, hover | **Table 1: Hierarchical decomposition of client event names.** For example, the following event, `web:home:mentions:stream:avatar:profile_click` is logged whenever there is an image profile click on the avatar of a tweet in the mentions timeline for a user on twitter.com (read from right to left). The alternate design was a tree based model for logging client events. That model allowed for arbitrarily deep event namespace with as fine-grained logging as required. But the client events model was chosen to make the top level aggregate queries easier. A client event is a Thrift structure that contains the components given in the table below. | Field | Description | |-----------------|---------------------------------| | event initiator | {client, server} × {user, app} | | event_name | event name | | user_id | user id | | session_id | session id | | ip | user’s IP address | | timestamp | timestamp | | event_details | event details | **Table 2: Definition of a client event.** The logging infrastructure is unified in two senses: * All log messages share a common format with clear semantics. * All log messages are stored in a single place. ## Session Sequences A session sequence is a sequence of symbols *S = {s<sub>0</sub>, s<sub>1</sub>, s<sub>2</sub>...s<sub>n</sub>}* such that each symbol is drawn from a finite alphabet *Σ*. A bijective mapping is defined between Σ and universe of event names. Each symbol in Σ is represented by a valid Unicode point (frequent events are assigned shorter code prints) and each session sequence becomes a valid Unicode string. Once all logs have been imported to the main database, a histogram of event counts is created and is used to map event names to Unicode code points. The counts and samples of each event type are stored in a known location in HDFS. Session sequences are reconstructed from the raw client event logs via a *group-by* on *user_id* and *session_id*. Session sequences are materialized as it is difficult to work with raw client event logs for following reasons: * A lot of brute force scans. * Large group-by operations needed to reconstruct user session. #### Alternate Designs Considered * Reorganize complete Thrift messages by reconstructing user sessions - This solves the second problem but not the first. * Use a columnar storage format - This addresses the first issue but it just reduces the time taken by mappers and not the number of mappers itself. The materialized session sequences are much smaller than raw client event logs (around 50 times smaller) and address both the issues. ## Client Event Catalog To enhance the accessibility of the client event logs, an automatically generated event data log is used along with a browsing interface to allow users to browse, search and access sample entries for the various client events. (These sample entries are the same entries that were mentioned in the previous section. The catalog is rebuilt every day and is always up to date. ## Applications Client Event Logs and Session Sequences are used in following applications: * Summary Statistics - Session sequences are used to compute various statistics about sessions. * Event Counting - Used to understand what feature of users take advantage of a particular feature. * Funnel Analytics - Used to focus on user attention in a multi-step process like signup process. * User Modeling - Used to identify "interesting" user behavior. N-gram models (from NLP domain) can be extended to measure how important temporal signals are by modeling user behavior on the basis of last n actions. The paper also mentions the possibility of extracting "activity collocations" based on the notion of collocations. ## Possible Extensions Session sequences are limited in the sense that they capture only event name and exclude other details. The solution adopted by Twitter is to use a generic indexing infrastructure that integrates with Hadoop at the level of InputFormats. The indexes reside with the data making it easier to reindex the data. An alternative would have been to use **Trojan layouts** which members indexing in HDFS block header but this means that indexing would require the data to be rewritten. Another possible extension would be to leverage more analogies from the field of Natural Language Processing. This would include the use of automatic grammar induction techniques to learn hierarchical decomposition of user activity. Another area of exploration is around leveraging advanced visualization techniques for exploring sessions and mapping interesting behavioral patterns into distinct visual patterns that can be easily recognized. |