Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1567 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Mask R-CNN

He, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

He, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

[link]
Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN). Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool. #### Methodology Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class. ##### Major Changes and intutions **Mask prediction** Mask prediction segmentation predicts a binary mask for each RoI using fully convolution - and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask. Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation **RoIAlign** RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of quantization of the RoI boundaries or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). **Backbone architecture** Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures - ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of single-scale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) **Training Objective** The training objective looks like this ![](https://i.imgur.com/snUq73Q.png) Lmask is the addition from Faster RCNN. The method to calculate was mentioned above #### Observation Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper |

Training Deep and Recurrent Networks with Hessian-Free Optimization

Martens, James and Sutskever, Ilya

Springer Neural Networks: Tricks of the Trade (2nd ed.) - 2012 via Local Bibsonomy

Keywords: dblp

Martens, James and Sutskever, Ilya

Springer Neural Networks: Tricks of the Trade (2nd ed.) - 2012 via Local Bibsonomy

Keywords: dblp

[link]
## Very Short Summary The authors introduce a number of modifications to traditional hessian-free optimisation that makes the method work better for neural networks. The modifications are: * Use the Generalised Gauss Newton Matrix (GGN) rather than the Hessian. * Damp the GGN so that $G' = G + \lambda I$ and adjust $\lambda$ using levenberg-marquardt heuristic. * Use an efficient recursion to calculate the GGN. * Initialise each round of conjugated gradients with the final vector of the previous iteration. * A new simpler termination criterion for CG. Terminate CG when the relative decrease in the objective falls below some threshold. * Back-tracking of the CG solution. ie you store intermediate solutions to CG and only update if the new CG solution actually decreases the over all problem objective. ## Less Short Summary ### Hessian Free Optimisation in General Hessian free optimisation is used when one wishes to optimise some objective $f(\theta)$ using second order methods but inversion or even computation of the Hessian is intractable or infeasible. The method is an iterative method and at each iteration, we take a second order approximation to the objective. i.e at iterantion n, we take a second order taylor expansion of $f$ to get: $M^n(\theta) = f(\theta^n) + \nabla_{\theta}^Tf(\theta^n)(\theta - \theta^n) + (\theta - \theta^n)^TH(\theta - \theta^n) $ Where $H$ is the hessian matrix. If we minimise this second order approximation with respect to $\theta$ we would find that that $\theta^{n+1} = H^{-1}(-\nabla_{\theta}^Tf(\theta^n))$. However, inverting $H$ is usually not possible for even moderately sized neural networks. There does however exist an efficient algorithm for calculating hessian vector products $Hv$ for any $v$. The insight of hessian-free optimisation is that one can solve linear problems of the form $Hx = v$ using only hessian vector products via the linear conjugated gradients algorithm. You therefore avoid the need to ever actually compute either the Hessian or its inverse. To run vanilla hessian free all you need to do at each iteration is: 1) Calculate the gradient vector using standard backprop. 2) Calculate $H\theta$ product using an efficient recursion. 3) calculate the next update $\theta^{n+1} = ConjugatedGradients(H, -\nabla_{\theta}^Tf(\theta^n))$ The main contribution of this paper is to take the above algorithm and make the changes outlined in the very short summary. ## Take aways Hessian-Free optimisation was perhaps the best method at the time of publication. Recently it seems that first order methods using per-parameter learning rates like ADAM or even learning-to-learn can outperform Hessian-Free. This is primarily because of the increased cost per iteration of Hessian Free. However it still seems that using curvature information if its available is beneficial though expensive. More resent second order curvature appoximations like Kroeniker Factored Approximate Curvature (KFAC) and Kroeniker Factored Recursive Approximation (KFRA) are cheaper ways to achieve the same benefit. |

Learning Representations for Counterfactual Inference

Johansson, Fredrik D. and Shalit, Uri and Sontag, David

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Johansson, Fredrik D. and Shalit, Uri and Sontag, David

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

[link]
This paper presents a method to train a neural network to make predictions for *counterfactual* questions. In short, such questions are questions about what the result of an intervention would have been, had a different choice for the intervention been made (e.g. *Would this patient have lower blood sugar had she received a different medication?*). One approach to tackle this problem is to collect data of the form $(x_i, t_i, y_i^F)$ where $x_i$ describes a situation (e.g. a patient), $t_i$ describes the intervention made (in this paper $t_i$ is binary, e.g. $t_i = 1$ if a new treatment is used while $t_i = 0$ would correspond to using the current treatment) and $y_i^F$ is the factual outcome of the intervention $t_i$ for $x_i$. From this training data, a predictor $h(x,t)$ taking the pair $(x_i, t_i)$ as input and outputting a prediction for $y_i^F$ could be trained. From this predictor, one could imagine answering counterfactual questions by feeding $(x_i, 1-t_i)$ (i.e. a description of the same situation $x_i$ but with the opposite intervention $1-t_i$) to our predictor and comparing the prediction $h(x_i, 1-t_i)$ with $y_i^F$. This would give us an estimate of the change in the outcome, had a different intervention been made, thus providing an answer to our counterfactual question. The authors point out that this scenario is related to that of domain adaptation (more specifically to the special case of covariate shift) in which the input training distribution (here represented by inputs $(x_i,t_i)$) is different from the distribution of inputs that will be fed at test time to our predictor (corresponding to the inputs $(x_i, 1-t_i)$). If the choice of intervention $t_i$ is evenly spread and chosen independently from $x_i$, the distributions become the same. However, in observational studies, the choice of $t_i$ for some given $x_i$ is often not independent of $x_i$ and made according to some unknown policy. This is the situation of interest in this paper. Thus, the authors propose an approach inspired by the domain adaptation literature. Specifically, they propose to have the predictor $h(x,t)$ learn a representation of $x$ that is indiscriminate of the intervention $t$ (see Figure 2 for the proposed neural network architecture). Indeed, this is a notion that is [well established][1] in the domain adaptation literature and has been exploited previously using regularization terms based on [adversarial learning][2] and [maximum mean discrepancy][3]. In this paper, the authors used instead a regularization (noted in the paper as $disc(\Phi_{t=0},\Phi_ {t=1})$) based on the so-called discrepancy distance of [Mansour et al.][4], adapting its use to the case of a neural network. As an example, imagine that in our dataset, a new treatment ($t=1$) was much more frequently used than not ($t=0$) for men. Thus, for men, relatively insufficient evidence for counterfactual inference is expected to be found in our training dataset. Intuitively, we would thus want our predictor to not rely as much on that "feature" of patients when inferring the impact of the treatment. In addition to this term, the authors also propose incorporating an additional regularizer where the prediction $h(x_i,1-t_i)$ on counterfactual inputs is pushed to be as close as possible to the target $y_{j}^F$ of the observation $x_j$ that is closest to $x_i$ **and** actually had the counterfactual intervention $t_j = 1-t_i$. The paper first shows a bound relating the counterfactual generalization error to the discrepancy distance. Moreover, experiments simulating counterfactual inference tasks are presented, in which performance is measured by comparing the predicted treatment effects (as estimated by the difference between the observed effect $y_i^F$ for the observed treatment and the predicted effect $h(x_i, 1-t_i)$ for the opposite treatment) with the real effect (known here because the data is simulated). The paper shows that the proposed approach using neural networks outperforms several baselines on this task. **My two cents** The connection with domain adaptation presented here is really clever and enlightening. This sounds like a very compelling approach to counterfactual inference, which can exploit a lot of previous work on domain adaptation. The paper mentions that selecting the hyper-parameters (such as the regularization terms weights) in this scenario is not a trivial task. Indeed, measuring performance here requires knowing the true difference in intervention outcomes, which in practice usually cannot be known (e.g. two treatments usually cannot be given to the same patient once). In the paper, they somewhat "cheat" by using the ground truth difference in outcomes to measure out-of-sample performance, which the authors admit is unrealistic. Thus, an interesting avenue for future work would be to design practical hyper-parameter selection procedures for this scenario. I wonder whether the *reverse cross-validation* approach we used in our work on our adversarial approach to domain adaptation (see [Section 5.1.2][5]) could successfully be used here. Finally, I command the authors for presenting such a nicely written description of counterfactual inference problem setup in general, I really enjoyed it! [1]: https://papers.nips.cc/paper/2983-analysis-of-representations-for-domain-adaptation.pdf [2]: http://arxiv.org/abs/1505.07818 [3]: http://ijcai.org/Proceedings/09/Papers/200.pdf [4]: http://www.cs.nyu.edu/~mohri/pub/nadap.pdf [5]: http://arxiv.org/pdf/1505.07818v4.pdf#page=16 |

Collaborative Filtering for Implicit Feedback Datasets

Hu, Yifan and Koren, Yehuda and Volinsky, Chris

International Conference on Data Mining - 2008 via Local Bibsonomy

Keywords: collaborativfiltering, alternaterootsquare

Hu, Yifan and Koren, Yehuda and Volinsky, Chris

International Conference on Data Mining - 2008 via Local Bibsonomy

Keywords: collaborativfiltering, alternaterootsquare

[link]
This paper is about a recommendation system approach using collaborative filtering (CF) on implicit feedback datasets. The core of it is the minimization problem $$\min_{x_*, y_*} \sum_{u,i} c_{ui} (p_{ui} - x_u^T y_i)^2 + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$ with * $\lambda \in [0, \infty[$ is a hyper parameter which defines how strong the model is regularized * $u$ denoting a user, $u_*$ are all user factors $x_u$ combined * $i$ denoting an item, $y_*$ are all item factors $y_i$ combined * $x_u \in \mathbb{R}^n$ is the latent user factor (embedding); $n$ is another hyper parameter. $n=50$ seems to be a reasonable choice. * $y_i \in \mathbb{R}^n$ is the latent item factor (embedding) * $r_{ui}$ defines the "intensity"; higher values mean user $u$ interacted more with item $i$ * $p_{ui} = \begin{cases}1 & \text{if } r_{ui} >0\\0 &\text{otherwise}\end{cases}$ * $c_{ui} := 1 + \alpha r_{ui}$ where $\alpha \in [0, \infty[$ is a hyper parameter; $\alpha =40$ seems to be reasonable In contrast, the standard matrix factoriation optimization function looks like this ([example](https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture25-mf.pdf)): $$\min_{x_*, y_*} \sum_{(u, i, r_{ui}) \in \mathcal{R}} {(r_{ui} - x_u^T y_i)}^2 + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$ where * $\mathcal{R}$ is the set of all ratings $(u, i, r_{ui})$ - user $u$ has rated item $i$ with value $r_{ui} \in \mathbb{R}$ They use alternating least squares (ALS) to train this model. The prediction then is the dot product between the user factor and all item factors ([source](https://github.com/benfred/implicit/blob/master/implicit/recommender_base.pyx#L157-L176)) |

Learning semantic representations using convolutional neural networks for web search

Shen, Yelong and He, Xiaodong and Gao, Jianfeng and Deng, Li and Mesnil, Grégoire

ACM WWW (Companion Volume) - 2014 via Local Bibsonomy

Keywords: dblp

Shen, Yelong and He, Xiaodong and Gao, Jianfeng and Deng, Li and Mesnil, Grégoire

ACM WWW (Companion Volume) - 2014 via Local Bibsonomy

Keywords: dblp

[link]
Here the authors present a model which projects queries and documents into a low dimensional space, where you can fetch relevant documents by computing distance, *here cosine is used*, between the query vector and the document vectors. ### Model Description #### Word Hashing Layer They have used bag of tri-grams for representing words(office -> #office# -> {#of, off, ffi, fic, ice, ce#}). This is able to generalize unseen words and maps morphological variation of same words to points which are close in n-gram space. #### Context Window Vector Then for representing a sentence they are taking a `Window Size` around a word and appending them to form a context window vector. If we take `Window Size` = 3: (He is going to Office -> { [vec of 'he', vec of 'is', vec of 'going'], [vec of 'is', vec of 'going', vec of 'to'], [vec of 'going', vec of 'to', vec of 'Office'] } #### Convolutional Layer and Max-Pool layer Run a convolutional layer over each of the context window vector (for an intuition these are local features). Max pool over the resulting features to get global features. The output dimension is taken here to be 300. #### Semantic Layer Use a fully connected layer and project the 300-D vector to a 128-D vector. They have used two different networks, one for queries and other for documents. Now for each query and document (we are given labeled documents, one of them is positive and rest are negative) they compute the cosine similarity of the 128-D output vector. And then they learn the weights of convolutional filters and the fully connected layer by maximizing conditional likelihood of positive documents. My thinking is that they have used two different networks as their is significant difference between Query length and Document Length. |

About