Joseph Paul Cohen's profile - ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Fast Model Editing at Scale
Eric Mitchell and Charles Lin and Antoine Bosselut and Chelsea Finn and Christopher D. Manning
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.LG, cs.AI, cs.CL
more

[link] Summary by Joseph Paul Cohen 2 years ago

The goal of this work is to edit the model’s weights given new edit pairs ($x_e, y_e$) at test time. They achieve this by learning a "model editor network" that takes a fine tuning gradient computed from ($x_e, y_e$) and transforms this into a weight update. 

$$ f(\nabla W_l) \rightarrow \tilde\nabla W_l$$ 

The editor network is parameterized by the layer that it is predicting using a FiLM style scale and shift.

The editor network is trained on a small set of examples ($D^{tr}_{edit}$). The paper states that this dataset contains edits that are similar to the "the types of edits that will be made." which is interesting because it introduces generalization limitations to the potential edits.

An extra loss term is used to prevent unintended changes for other inputs to the model (called $x_{loc}$). This is achieved with the following loss that will maintain the predictions to be the same value.
$$L_{loc} = KL(p_{\theta_W}(\cdot | x_{loc}) \| p_{\theta_\tilde{W}}(\cdot | x_{loc}))$$
 

Some intuition for why this works is editor network $f$ approximates full dataset gradient from just a single example so it is more efficient. It can reduce the change of elements of the weight matrix which were disruptive to the loss when it was trained, information that requires many training examples to uncover.

arxiv.org
arxiv-vanity.com
scholar.google.com

Gifsplanation via Latent Shift: A Simple Autoencoder Approach to Counterfactual Generation for Chest X-rays
Joseph Paul Cohen and Rupert Brooks and Sovann En and Evan Zucker and Anuj Pareek and Matthew P. Lungren and Akshay Chaudhari
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.CV, cs.AI, eess.IV
more

[link] Summary by Joseph Paul Cohen 5 years ago

**Background:** The goal of this work is to indicate image features which are relevant to the prediction of a neural network and convey that information to the user by displaying a counterfactual image animation.

**The Latent Shift Method:** This method works on any pretrained encoder/decoder and classifier which is differentiable. No special considerations are needed during model training. With this approach they want the exact opposite of an adversarial attack but it is using the same idea. They want to perturb the input image so that the classifier reduces its prediction. If they just compute $\frac{\partial f}{\partial x}$ and move the pixels directly then they will get an imperceivable difference like an adversarial attack. Using a decoder they can regularize the transformation so it will only yield value images.

The encoder takes the input image and encodes it into a latent representation $z$. Then the decoder reconstructs the image and feeds this image into the classifier. The gradient is computed from the output of the classifier with respect to $z$. Subtracting the gradient from z and reconstructing the image generates a counterfactual.

https://i.imgur.com/iuZGUTH.gif

They found that if they change the prediction by -30% the images come out pretty good. So an iterative search along the vector defined by the gradient in the latent space until the prediction is reduced by 30%.

From this sequence a 2D image can be reconstructed which is similar to a traditional attribution map by taking the maximum pixel wise difference between every image and the unperturbed reconstruction.

https://i.imgur.com/V3PCgXZ.png

The results look great!

https://i.imgur.com/DBki84c.gif

https://i.imgur.com/kFfQNKD.gif

In order to validate if this approach can help spot false positive predictions, two radiologists to evaluate how confident they were in a models predictions. For each image, radiologists viewed the prediction in two ways, using traditional methods or the Latent Shift images. Traditional methods includes the image gradient, guided backprop, and integrated gradients. The Latent Shift Counterfactual includes the animation as well as the 2D version.

https://i.imgur.com/TlUBhzL.png

What they would like to see, that for true positives, the results are all 5 and for false positives they are all 1.
What they observe however, is that many false positives still cause high confidence in the model predictions but not as much as the true positives. Between these two methods they find for true positives that the latent shift counterfactuals show a significant increase in confidence which is good.

> 0.15±0.95 confidence increase using the Latent Shift method (p=0.01).

For false positives they find an increase in confidence but it is not significant.

> 0.04±1.06 increase which is not significant (p=0.57)

**Conclusions:**
- Latent Shift's ability to generate counterfactuals is pretty good!
- Vanilla autoencoders are sufficient for some pathologies.
- StyleGAN and higher quality models should improve performance.
- IoU analysis may not be the best fit.
- Explainable AI methods can have an impact on the user confidence in the model.

(Disclaimer: I am the author of this work)

Project Website: https://mlmed.org/gifsplanation/

arxiv.org
arxiv-vanity.com
scholar.google.com

Unsupervised Learning via Meta-Learning
Kyle Hsu and Sergey Levine and Chelsea Finn
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.AI, cs.CV, stat.ML
more

[link] Summary by Joseph Paul Cohen 7 years ago

What is stopping us from applying meta-learning to new tasks? Where do the tasks come from? Designing task distribution is laborious. We should automatically learn tasks!

Unsupervised Learning via Meta-Learning: The idea is to use a distance metric in an out-of-the-box unsupervised embedding space created by BiGAN/ALI or DeepCluster to construct tasks in an unsupervised way. If you cluster points to randomly define classes (e.g. random k-means) you can then sample tasks of 2 or 3 classes and use them to train a model.

Where does the extra information come from? The metric space used for k-means asserts specific distances. The intuition why this works is that it is useful model initialization for downstream tasks.

This summary was written with the help of Chelsea Finn.

proceedings.mlr.press
scholar.google.com

Model-Based Reinforcement Learning via Meta-Policy Optimization
Clavera, Ignasi and Rothfuss, Jonas and Schulman, John and Fujita, Yasuhiro and Asfour, Tamim and Abbeel, Pieter
Conference on Robot Learning - 2018 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 7 years ago

In terms of model based RL, learning dynamics models is imperfect, which often leads to the learned policy overfitting to the learned dynamics model, doing well in the learned simulator but not in the real world.

Key solution idea: No need to try to learn one accurate simulator. We can learn an ensemble of models that together will sufficiently represent the space. If we learn an ensemble of models (to be used as many learned simulators) we can denoise estimates of performance. In a meta-learning sense these simulations become the tasks. The real world is then just yet another task, to which the policy could adapt quickly.  One experimental observation is that at the start of training there is a lot of variation between learned simulators, and then the simulations come together over training, which might also point to this approach providing improved exploration.

This summary was written with the help of Pieter Abbeel.

arxiv.org
arxiv-vanity.com
scholar.google.com

A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms
Yoshua Bengio and Tristan Deleu and Nasim Rahaman and Rosemary Ke and Sébastien Lachapelle and Olexa Bilaniuk and Anirudh Goyal and Christopher Pal
arXiv e-Print archive - 2019 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by Joseph Paul Cohen 7 years ago

How can we learn causal relationships that explain data? We can learn from non-stationary distributions. If we experiment with different factorizations of relationships between variables we can observe which ones provide better sample complexity when adapting to distributional shift and therefore are likely to be causal.

If we consider the variables A and B we can factor them in two ways:

$P(A,B) = P(A)P(B|A)$ representing a causal graph like $A\rightarrow B$

$P(A,B) = P(A|B)P(B)$ representing a causal graph like $A \leftarrow B$

The idea is if we train a model with one of these structures; when adapting to a new shifted distribution of data it will take longer to adapt if the model does not have the correct inductive bias. For example let's say that the true relationship is $A$=Raining causes $B$=Open Umbrella (and not vice-versa). Changing the marginal probability of Raining (say because the weather changed) does not change the mechanism that relates $A$ and $B$ (captured by $P(B|A)$), but will have an impact on the marginal $P(B)$. 

So after this distributional shift the function that modeled $P(B|A)$ will not need to change because the relationship is the same.  Only the function that modeled $P(A)$ will need to change. Under the incorrect factorization $P(B)P(A|B)$, adaptation to the change will be slow because both $P(B)$ and $P(A|B)$ need to be modified to account for the change in $P(A)$ (due to Bayes rule).

Here a difference in sample complexity can be observed when modeling the joint of the shifted distribution.  $B\rightarrow A$ takes longer to adapt:
https://i.imgur.com/B9FEmA7.png

Here the idea is that sample complexity when adapting to a new distribution of data is a heuristic to inform us which causal graph inductive bias is correct.

Experimentally this works and they also observe that when models have more capacity it seems that the difference between the models grows.

This summary was written with the help of Yoshua Bengio.

arxiv.org
arxiv-vanity.com
scholar.google.com

Systematic Generalization: What Is Required and Can It Be Learned?
Dzmitry Bahdanau and Shikhar Murty and Michael Noukhovitch and Thien Huu Nguyen and Harm de Vries and Aaron Courville
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CL, cs.AI
more

[link] Summary by Joseph Paul Cohen 7 years ago

The paper discusses neural module network trees (NMN-trees). Here modules are composed in a tree structure to answer a question/task and modules are trained in different configurations to ensure they learn more core concepts and can generalize.

Longer summary:

How to perform systematic generalization? First we need to ask how
good current models are at understanding language. Adversarial
examples show how fragile these models can be. This leads us to
conclude that systematic generalization is an issue that requires
specific attention.

Maybe we should rethink the modeling assumptions being made. We can
think that samples can come from different data domains but are
generated by some set of shared rules. If we correctly learned these
rules then domain shift in the test data would not hurt model
performance. Currently we can construct an experiment to introduce
systematic bias in the data which causes the performance to suffer.
From this experiment we can start to determine what the issue is.

A recent new idea is to force a model to have more independent units
is neural module network trees (NMN-trees). Here modules are composed
in a tree structure to answer a question/task and modules are trained
in different configurations to ensure they learn more core concepts
and can generalize.

arxiv.org
arxiv-vanity.com
scholar.google.com

Effective Ways to Build and Evaluate Individual Survival Distributions
Humza Haider and Bret Hoehn and Sarah Davis and Russell Greiner
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by Joseph Paul Cohen 7 years ago

The paper looks at approaches to predicting individual survival time distributions (isd). The motivation is shown in the figure below. Between two patients the survival time varies greatly so we should be able to predict a distribution like the red curve.

https://i.imgur.com/2r9JvUp.png

The paper studies the following methods: 
 - class-based survival curves Kaplan-Meier [31]
 - Kalbfleisch-Prentice extension of the Cox (cox-kp) [29]
 - Accelerated Failure Time (aft) model [29]
 - Random Survival Forest model with Kaplan-Meier extensions (rsf-km)
 - elastic net Cox (coxen-kp) [55] 
 - Multi-task Logistic Regression (mtlr) [57]

Looking at the predictions of these methods side by side we can observe some systematic differences between the methods:
https://i.imgur.com/vJoCL4a.png

The paper presents a "D-Calibration" metric (distributional calibration) which represents of the method answers this question:

    Should the patient believe the predictions implied by the survival curve?


https://i.imgur.com/MX8CbZ7.png

aclweb.org
scholar.google.com

Bidirectional RNN for Medical Event Detection in Electronic Health Records
Jagannatha, Abhyuday N. and Yu, Hong
The Association for Computational Linguistics HLT-NAACL - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 7 years ago

The basic approach is an RNN applied to text to predict a medical event such as an ICD code. It is unclear if the complicated Bi-RNN model is required. 

This has some useful applications such as 
- Adapt old databases
- Correct errors
- Upgrade ICD versions

A simple diagram of an RNN applied to medical next is shown below:


https://i.imgur.com/NPExLqH.png

arxiv.org
arxiv-vanity.com
scholar.google.com

Multi-layer Representation Learning for Medical Concepts
Edward Choi and Mohammad Taha Bahadori and Elizabeth Searles and Catherine Coffey and Jimeng Sun
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Joseph Paul Cohen 7 years ago

This model called Med2Vec is inspired by Word2Vec. It is Word2Vec for time series patient visits with ICD codes. The model learns embeddings for medical codes as well as the demographics of patients.

https://i.imgur.com/Zjj6Xxz.png

The context is temporal. For each $x_t$ as input the model predicts $x_{t+1}$ and $x_{t-1}$ or more depending on the temporal window size.

arxiv.org
scholar.google.com

A Comparison of Word Embeddings for the Biomedical Natural Language Processing
Yanshan Wang and Sijia Liu and Naveed Afzal and Majid Rastegar-Mojarad and Liwei Wang and Feichen Shen and Paul Kingsbury and Hongfang Liu
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.IR
more

[link] Summary by Joseph Paul Cohen 7 years ago

This paper demonstrates that Word2Vec \cite{1301.3781} can extract relationships between words and produce latent representations useful for medical data. They explore this model on different datasets which yield different relationships between words.

https://i.imgur.com/hSA61Zw.png

The Word2Vec model works like an autoencoder that predicts the context of a word. The context of a word is composed of the surrounding words as shown below. Given the word in the center the neighboring words are predicted through a bottleneck in the autoencoder. A word has many contexts in a corpus so the model can never have 0 error. The model must minimize the reconstruction which is how it learns the latent representation.

https://i.imgur.com/EMtjTHn.png

Subjectively we can observe the relationship between word vectors:

https://i.imgur.com/8C9EVq1.png

papers.nips.cc
scholar.google.com

A Regularized Framework for Sparse and Structured Neural Attention.
Vlad Niculae and Mathieu Blondel
Neural Information Processing Systems Conference - 2017 via Local dblp
Keywords:

[link] Summary by Joseph Paul Cohen 8 years ago

The idea in this paper is to develop a version of attention that will incorporate similarity in neighboring bins. This aligned with the work \cite{conf/icml/BeckhamP17} which presented a different approach to deal with consistency between classes of predictions.

In this work the closed form softmax function is replaced by a small optimization problem with this regularizer:

$$ +\lambda \sum_{i=1}^{d-1} |y_{i+1}-y_i|$$

Because of this, many of the neighboring probabilities are exactly the same resulting in attention that can be seen as blocks.

https://i.imgur.com/oue0x4V.png

Poster:
https://i.imgur.com/gclMjzR.png

arxiv.org
arxiv-vanity.com
scholar.google.com

Rotation equivariant vector field networks
Diego Marcos and Michele Volpi and Nikos Komodakis and Devis Tuia
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Joseph Paul Cohen 8 years ago

This work deals with rotation equivariant convolutional filters. The idea is that when you rotate an image you should not need to relearn new filters to deal with this rotation. First we can look at how convolutions typically handle rotation and how we would expect a rotation invariant solution to perform below:

| | |
| - | - |
| https://i.imgur.com/cirTi4S.png | https://i.imgur.com/iGpUZDC.png |
| | | |

The method computes all possible rotations of the filter which results in a list of activations where each element represents a different rotation. From this list the maximum is taken which results in a two dimensional output for every pixel (rotation, magnitude). This happens at the pixel level so the result is a vector field over the image.


https://i.imgur.com/BcnuI1d.png

We can visualize their degree selection method with a figure from https://arxiv.org/abs/1603.04392 which determined the rotation of a building:

https://i.imgur.com/hPI8J6y.png

We can also think of this approach as attention \cite{1409.0473} where they attend over the possible rotations to obtain a score for each possible rotation value to pass on. The network can learn to adjust the rotation value to be whatever value the later layers will need. 

------------------------

Results on [Rotated MNIST](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/MnistVariations) show an impressive improvement in training speed and generalization error:




https://i.imgur.com/YO3poOO.png

arxiv.org
arxiv-vanity.com
scholar.google.com

Introspective Generative Modeling: Decide Discriminatively
Justin Lazarow and Long Jin and Zhuowen Tu
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV, cs.LG, cs.NE
more

[link] Summary by Joseph Paul Cohen 8 years ago

In this work they take a different approach to the GAN model \cite{1406.2661}. In the traditionally GAN model a neural network is trained to up-sample from random noise in a feed forward fashion to generate samples from the data distribution. 

This work instead iteratively permutes an image of random noise similar to Artistic Style Transfer \cite{1508.06576}.  The image is permuted in order to fool a set of discriminators. To obtain the set of discriminators each is trained starting from random noise until some max $t$ step. 


1. At first a discriminator is trained to discriminate between the true data and random noise . 
2. Images is then permuted using gradients which aim to fool the discriminator and included in the data distribution as a negative example.
3. The discriminator is trained on the true data + random noise + fake data from the previous steps

The images generated at each step are shown below:

https://i.imgur.com/kp575s8.png

After being trained the model is able to generate a sample by iterating over each trained discriminator and applying gradient updates on from random noise. For this storing only the weights of the discriminators is required.

Poster from ICCV2017:
https://i.imgur.com/vYSSdZx.png

dx.doi.org
sci-hub
scholar.google.com

Gene expression inference with deep learning
Chen, Yifei and Li, Yi and Narayan, Rajiv and Subramanian, Aravind and Xie, Xiaohui
Bioinformatics - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 8 years ago

"This deals with a specific prediction task, namely to predict the expression of specified target genes from a panel of about 1,000 pre-selected “landmark genes”. As the authors explain, gene expression levels are often highly correlated and it may be a cost-effective strategy in some cases to use such panels and then computationally infer the expression of other genes. Based on Pylearn2/Theano."

https://github.com/uci-cbcl/D-GEX

https://followthedata.wordpress.com/2015/12/21/list-of-deep-learning-implementations-in-biology/

arxiv.org
scholar.google.com

Network-regularized Sparse Logistic Regression Models for Clinical Risk Prediction and Biomarker Discovery
Wenwen Min and Juan Liu and Shihua Zhang
arXiv e-Print archive - 2016 via Local arXiv
Keywords: q-bio.GN, cs.LG, stat.ML, J.3; H.2.8; G.1.6; I.5
more

[link] Summary by Joseph Paul Cohen 8 years ago

In this paper they prior the representation a logistic regression model using known protein-protein interactions. They do so by regularizing the weights of the model using the Laplacian encoding of a graph. 

Here is a regularization term of this form:

$$\lambda ||w||_1 + \eta w^T L w,$$

#### A small example:

Given a small graph of three nodes A, B, and C with one edge: {A-B} we have the following Laplacian:

$$
L = D - A = 
\left[\array{
1 & 0 & 0 \\
0 & 1 & 0\\
0 & 0 & 0}\right]
-
\left[\array{
0 & 1 & 0 \\
1 & 0 & 0\\
0 & 0 & 0}\right]$$

$$L = 
\left[\array{
1 & -1 & 0 \\
-1 & 1 & 0\\
0 & 0 & 0}\right]
$$

If we have a small linear regression of the form:

$$y = x_Aw_A + x_Bw_B + x_Cw_C$$

Then we can look at how $w^TLw$ will impact the weights to gain insight:

$$w^TLw $$

$$=
\left[\array{
w_A &
w_B &
w_C}\right]
\left[\array{
1 & -1 & 0 \\
-1 & 1 & 0\\
0 & 0 & 0}\right]
\left[\array{
w_A \\
w_B \\
w_C}\right] 
$$

$$= 
\left[\array{
w_A &
w_B &
w_C}\right]
\left[\array{
w_A -w_B \\
-w_A + w_B \\
0}\right] 
$$



$$
= 
(w_A^2 -w_Aw_B ) + 
(-w_Aw_B + w_B^2)
$$

So because all terms are squared we can remove them from consideration to look at what is the real impact of regularization.

$$
= 
(-w_Aw_B ) + 
(-w_Aw_B)
$$

$$ = -2w_Aw_B$$

The Laplacian regularization seems to increase the weight values of edges which are connected. Along with the squared terms and the $L1$ penalty that is also used the weights cannot grow without bound.

#### A few more experiments:

If we perform the same computation for a graph with two edges: {A-B, B-C} we have the following term which increases the weights of both pairwise interactions:

$$ = -2w_Aw_B -2w_Bw_C$$

If we perform the same computation for a graph with two edges: {A-B, A-C} we have no surprises: 

$$ = -2w_Aw_B -2w_Aw_C$$

Another thing to think about is if there are no edges. If by default there are self-loops then the degree matrix will have 1 on the diagonal and it will be the identity which will be an $L2$ term. If no self loops are defined then the result is a 0 matrix yielding no regularization at all.

#### Contribution:

A contribution of this paper is to use the absolute value of the weights to make training easier. 

$$|w|^T L |w|$$

TODO: Add more about how this impacts learning.



#### Overview

Here a high level figure shows the data and targets together with a graph prior. It looks nice so I wanted to include it.

https://i.imgur.com/rnGtHqe.png

doi.org
sci-hub
scholar.google.com

Efficient Deformable Motion Correction for 3-D Abdominal MRI Using Manifold Regression
Chen, Xin and Balfour, Daniel R. and Marsden, Paul K. and Reader, Andrew J. and Prieto, Claudia and King, Andrew P.
Medical Image Computing and Computer Assisted Interventions Conference - 2017 via Local Bibsonomy
Keywords: dblp

2	[link] Summary by Joseph Paul Cohen 8 years ago This work aims to produce more spatially consistent MRI image when the patient is breathing during MRI acquisition. https://i.imgur.com/wWMQa1D.png more less

doi.org
sci-hub
scholar.google.com

Deep Adversarial Networks for Biomedical Image Segmentation Utilizing Unannotated Images
Zhang, Yizhe and Yang, Lin and Chen, Jianxu and Fredericksen, Maridel and Hughes, David P. and Chen, Danny Z.
Medical Image Computing and Computer Assisted Interventions Conference - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 8 years ago

This work improves the performance of a segmentation network by utilizing unlabelled data. They use a discriminator (they call EN) to distinguish between annotated and unannotated examples. They then train the segmentation generator (they call SN) based on what will fool the discriminator. 

https://i.imgur.com/7CfKnh5.png

Three training phases are shown above

This work is really great. They are using the segmentation to condition the discriminator which will learn to point out flaws when applying the segmentation to the unlabelled examples. Then these flaws in the segmentation are corrected by using the gradients from the discriminator to adjust the segmentation.

In contrast with other semi-supervised approaches which learn a latent space for all samples, labelled and unlabelled, and then uses this space to learn a classifier or segmentation; this approach looks for the boundaries of the space only. The unlabelled examples are used to bias the representation learned by the segmentation network to conform to the distribution represented by all observed examples.

Read this paper for more: https://arxiv.org/abs/1611.08408

Poster:
https://i.imgur.com/eR5jgwn.png

arxiv.org
arxiv-vanity.com
scholar.google.com

Squeeze-and-Excitation Networks
Jie Hu and Li Shen and Gang Sun
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Joseph Paul Cohen 8 years ago

"The SE module can learn some nonlinear global interactions already known to be useful, such as spatial normalization. The channel wise weights make it somewhat more powerful than divisive normalization as it can learn feature-specific inhibitions (ie: if we see a lot of flower parts, the probability of boat features should be diminished). It also has some similarity to bio inhibitory circuits." By jcannell on reddit

Slides: http://image-net.org/challenges/talks_2017/SENet.pdf

Summary by the author Jie Hu:

Our motivation is to explicitly model the interdependence between feature channels. In addition, we do not intend to introduce a new spatial dimension for the integration of feature channels, but rather a new "feature re-calibration" strategy. Specifically, it is through learning the way to automatically obtain the importance of each feature channel, and then in accordance with this importance to enhance the useful features and inhibit the current task is not useful features.

https://i.imgur.com/vXyBg4j.png

The above figure is a schematic diagram of our proposed SE module. Given an input $x$, the number of characteristic channels is $c_1$, and the characteristic number of a characteristic channel is $c_2$ by a series of convolution and other general transformations. Unlike traditional CNNs, we then re-calibrate the features we received in the next three operations.

The first is the Squeeze operation, we carry out the feature compression along the spatial dimension, and turn each two-dimensional feature channel into a real number. The real number has a global sense of the wild, and the output dimension and the number of input channels Match. It characterizes the global distribution of responses on the feature channel, and makes it possible to obtain a global sense of the field near the input, which is very useful in many tasks.

Followed by the Excitation operation, which is a mechanism similar to the door in a circular neural network. The weight is generated for each feature channel by the parameter $w$, where the parameter w is learned to explicitly model the correlation between the feature channels.

Reddit thread: https://www.reddit.com/r/MachineLearning/comments/6pt99z/r_squeezeandexcitation_networks_ilsvrc_2017/

doi.org
sci-hub
scholar.google.com

Supervised Intra-embedding of Fisher Vectors for Histopathology Image Classification
Song, Yang and Chang, Hang and Huang, Heng and Cai, Weidong
Medical Image Computing and Computer Assisted Interventions Conference - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 8 years ago

The goal of this work is to classify histopathology images into benign and malignant.  They use the BreaKHis and IICBU 2008 lymphoma datasets.

They use a VGG network for feature extraction from each image. Then on these VGG feature vectors they learn [Fisher Vectors ](https://prateekvjoshi.com/2014/08/23/image-classification-using-fisher-vectors/) which they use to make a prediction.

It is unclear why Fisher Vectors are more useful than the fully connected layers of the VGG net that they replace. It is not clear how much analysis was performed for the VGG baseline. Also, as a baseline a VGG network should have been trained from scratch to extract domain specific features. 

Poster:
https://i.imgur.com/fgzmeYv.png

doi.org
sci-hub
scholar.google.com

Training CNNs for Image Registration from Few Samples with Model-based Data Augmentation
Uzunova, Hristina and Wilms, Matthias and Handels, Heinz and Ehrhardt, Jan
Medical Image Computing and Computer Assisted Interventions Conference - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 8 years ago

The authors state that the usual approach to cope with few training samples is data augmentation. They extend a method of modelling the data from \cite{10.1016/j.media.2017.02.003} and use it to train a neural network. The figure below shows the overview:

https://i.imgur.com/joLNyfc.png

At the core of deformation model they determine a set of $m$ landmarks $s_i$ which they will deform and then perform an affine transformation to warp the image to align to these points. The points are moved in a constrained way. They state the constraint is a "multi-level B-spline scattered data approximation".

Here is the poster: https://i.imgur.com/enQQqxC.png

arxiv.org
arxiv-vanity.com
scholar.google.com

Image-to-Image Translation with Conditional Adversarial Networks
Phillip Isola and Jun-Yan Zhu and Tinghui Zhou and Alexei A. Efros
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Joseph Paul Cohen 8 years ago

Summary by [brannondorsey](https://gist.github.com/brannondorsey/fb075aac4d5423a75f57fbf7ccc12124):

- Euclidean distance between predicted and ground truth pixels is not a good method of judging similarity because it yields blurry images.
- GANs learn a loss function rather than using an existing one.
- GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss.
- Conditional GANs (cGANs) learn a mapping from observed image `x` and random noise vector `z` to `y`: `y = f(x, z)`
- The generator `G` is trained to produce outputs that cannot be distinguished from "real" images by an adversarially trained discrimintor, `D` which is trained to do as well as possible at detecting the generator's "fakes".
- The discriminator `D`, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator.
- Unlike an unconditional GAN, both the generator and discriminator observe an input image `z`.
- Asks `G` to not only fool the discriminator but also to be near the ground truth output in an `L2` sense.
- `L1` distance between an output of `G` is used over `L2` because it encourages less blurring.
- Without `z`, the net could still learn a mapping from `x` to `y` but would produce deterministic outputs (and therefore fail to match any distribution other than a delta function. Past conditional GANs have acknowledged this and provided Gaussian noise `z` as an input to the generator, in addition to `x`)
- Either vanilla encoder-decoder or Unet can be selected as the model for `G` in this implementation.
- Both generator and discriminator use modules of the form convolution-BatchNorm-ReLu.
- A defining feature of image-to-image translation problems is that they map a high resolution input grid to a high resolution output grid.
- Input and output images differ in surface appearance, but both are renderings of the same underlying structure. Therefore, structure in the input is roughly aligned with structure in the output.
- `L1` loss does very well at low frequencies (I think this means general tonal-distribution/contrast, color-blotches, etc) but fails at high frequencies (crispness/edge/detail) (thus you get blurry images). This motivates restricting the GAN discriminator to only model high frequency structure, relying on an `L1` term to force low frequency correctness. In order to model high frequencies, it is sufficient to restrict our attention to the structure in local image patches. Therefore, we design a discriminator architecture – which we term a PatchGAN – that only penalizes structure at the scale of patches. This discriminator tries to classify if each `NxN`patch in an image is real or fake. We run this discriminator convolutationally across the image, averaging all responses to provide the ultimate output of `D`.
- Because PatchGAN assumes independence between pixels seperated by more than a patch diameter (`N`) it can be thought of as a form of texture/style loss.
- To optimize our networks we alternate between one gradient descent step on `D`, then one step on `G` (using minibatch SGD applying the Adam solver)
- In our experiments, we use batch size `1` for certain experiments and `4` for others, noting little difference between these two conditions.
- __To explore the generality of conditional GANs, we test the method on a variety of tasks and datasets, including both graphics tasks, like photo generation, and vision tasks, like semantic segmentation.__
- Evaluating the quality of synthesized images is an open and difficult problem. Traditional metrics such as per-pixel mean-squared error do not assess joint statistics of the result, and therefore do not measure the very structure that structured losses aim to capture.
- FCN-Score: while quantitative evaluation of generative models is known to be challenging, recent works have tried using pre-trained semantic classifiers to measure the discriminability of the generated images as a pseudo-metric. The intuition is that if the generated images are realistic, classifiers trained on real images will be able to classify the synthesized image correctly as well.
- cGANs seems to work much better than GANs for this type of image-to-image transformation, as it seems that with a GAN, the generator collapses into producing nearly the exact same output regardless of the input photograph.
- `16x16` PatchGAN produces sharp outputs but causes tiling artifacts, `70x70` PatchGAN alleviates these artifacts. `256x256` ImageGAN doesn't appear to improve the tiling artifacts and yields a lower FCN-score.
- An advantage of the PatchGAN is that a fixed-size patch discriminator can be applied to arbitrarily large images. This allows us to train on, say, `256x256` images and test/sample/generate on `512x512`.
- cGANs appear to be effective on problems where the output is highly detailed or photographic, as is common in image processing and graphics tasks.
- When semantic segmentation is required (i.e. going from image to label) `L1` performs better than `cGAN`. We argue that for vision problems, the goal (i.e. predicting output close to ground
truth) may be less ambiguous than graphics tasks, and reconstruction losses like L1 are mostly sufficient.

### Conclusion

The results in this paper suggest that conditional adversarial networks are a promising approach for many image-to-image translation tasks, especially those involving highly structured graphical outputs. These networks learn a loss adapted to the task and data at hand, which makes them applicable in a wide variety of settings.

### Misc

- Least absolute deviations (`L1`) and Least square errors (`L2`) are the two standard loss functions, that decides what function should be minimized while learning from a dataset. ([source](http://rishy.github.io/ml/2015/04/28/l1-vs-l2-loss/))
- How, using pix2pix, do you specify a loss of `L1`, `L1+GAN`, and `L1+cGAN`?

### Resources
- [GAN paper](https://arxiv.org/pdf/1406.2661.pdf)

arxiv.org
scholar.google.com

Learning Hierarchical Features from Generative Models
Zhao, Shengjia and Song, Jiaming and Ermon, Stefano
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

4	[link] Summary by Joseph Paul Cohen 8 years ago A Critical Paper Review by Alex Lamb: https://www.youtube.com/watch?v=_seX4kZSr_8 more less

arxiv.org
arxiv-vanity.com
scholar.google.com

On orthogonality and learning recurrent networks with long term dependencies
Eugene Vorontsov and Chiheb Trabelsi and Samuel Kadoury and Chris Pal
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, cs.NE
more

3	[link] Summary by Joseph Paul Cohen 8 years ago Here is a video overview: https://www.youtube.com/watch?v=t-fow6GJepQ Here is an image of the poster: https://i.imgur.com/Ti9btj9.png more less 1 Comments

proceedings.mlr.press
scholar.google.com

Unimodal Probability Distributions for Deep Ordinal Classification
Beckham, Christopher and Pal, Christopher J.
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

1	[link] Summary by Joseph Paul Cohen 8 years ago Short overview from ICML: https://youtube.com/watch?v=GMG5bFciuIA Long overview from ICML: https://youtu.be/o6dtDuldsEo more less

arxiv.org
arxiv-vanity.com
scholar.google.com

Self-Normalizing Neural Networks
Günter Klambauer and Thomas Unterthiner and Andreas Mayr and Sepp Hochreiter
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by Joseph Paul Cohen 9 years ago

"Using the "SELU" activation function, you get better results than any other activation function, and you don't have to do batch normalization. The "SELU" activation function is:

if x<0, 1.051\*(1.673\*e^x-1.673) if x>0, 1.051\*x" 

Source: narfon2, reddit


```
import numpy as np

def selu(x):
    alpha = 1.6732632423543772848170429916717
    scale = 1.0507009873554804934193349852946
    return scale*np.where(x>=0.0, x, alpha*np.exp(x)-alpha)
```
Source: CaseOfTuesday, reddit

Discussion here: https://www.reddit.com/r/MachineLearning/comments/6g5tg1/r_selfnormalizing_neural_networks_improved_elu/

arxiv.org
scholar.google.com

Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau, Dzmitry and Cho, Kyunghyun and Bengio, Yoshua
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

One core aspect of this attention approach is that it provides the ability to debug the learned representation by visualizing the softmax output (later called $\alpha_{ij}$) over the input words for each output word as shown below.

https://i.imgur.com/Kb7bk3e.png

In this approach each unit in the RNN they attend over the previous states, unitwise so the length can vary, and then apply a softmax and use the resulting probabilities to multiply and sum each state. This forms the memory used by each state to make a prediction. This bypasses the need for the network to encode everything in the state passed between units.

Each hidden unit is computed as:

$$s_i = f(s_{i−1}, y_{i−1}, c_i).$$

Where $s_{i−1}$ is the previous state and $y_{i−1}$ is the previous target word. Their contribution is $c_i$. This is the context vector which contains the memory of the input phrase.

$$c_i = \sum_{j=1} \alpha_{ij} h_j$$

Here $\alpha_{ij}$ is the output of a softmax for the $j$th element of the input sequence. $h_j$ is the hidden state at the point the RNN was processing the input sequence.

papers.nips.cc
scholar.google.com

Learning To Count Objects in Images
Lempitsky, Victor S. and Zisserman, Andrew
Neural Information Processing Systems Conference - 2010 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

They introduce the concept of counting in images by predicting a density map. Their training only requires dot annotations on the center of objects. Each dot is expanded to a gaussian to form a density. A model is trained to predict this density and then the total count is recovered by integrating over the resulting density map.

They create a function to produce the density based on quantized dense SIFT features \cite{lowe03distinctive} from every pixel in the image. A simple version of the definition of $F$ is shown below. Each pixel becomes an $x_p$ vector which is used to train and model to implement the function $F$.

$$\forall p \in I, \hspace{10pt } F(p|w) = wx_p $$

The obtained quantized dense SIFT features using the [VLFEAT](http://www.vlfeat.org/overview/dsift.html) library. The significant part of the code is shown below:

```
im = imread(['data/' num2str(j, '%03d') 'cell.png']);
im = im(:,:,3); %using the blue channel to compute data

disp('Computing dense SIFT...');
[f d] = vl_dsift(single(im)); %computing the dense sift descriptors centered at each pixel
%estimating the crop parameters where SIFTs were not computed:
minf = floor(min(f,[],2));
maxf = floor(max(f,[],2));
minx = minf(1);
miny = minf(2);
maxx = maxf(1);
maxy = maxf(2);   

%simple quantized dense SIFT, each image is encoded as MxNx1 numbers of
%dictionary entries numbers with weight 1 (see the NIPS paper):
disp('Quantizing SIFTs...');
features{j} = vl_ikmeanspush(uint8(d),Dict);
features{j} = reshape(features{j}, maxy-miny+1, maxx-minx+1);
weights{j} = ones(size(features{j}));   
```

The benchmark their algorithm using "Bacterial cells in fluorescence-light microscopy images". The heatmap to the right shows the predicted density.

https://i.imgur.com/Vz463nu.png

The evaluation is performed by training on $N$ images (with $N$ in a validation set) and the testing on 100 randomly picked images in a hold out set. They show that using more images results in less variance and higher accuracy.

https://i.imgur.com/hihfC8V.png

Paper website: http://www.robots.ox.ac.uk/~vgg/research/counting/index_org.html

arxiv.org
arxiv-vanity.com
scholar.google.com

Unsupervised Domain Adaptation by Backpropagation
Yaroslav Ganin and Victor Lempitsky
arXiv e-Print archive - 2014 via Local arXiv
Keywords: stat.ML, cs.LG, cs.NE
more

[link] Summary by Joseph Paul Cohen 9 years ago

The goal of this method is to create a feature representation $f$ of an input $x$ that is domain invariant over some domain $d$. The feature vector $f$ is obtained from $x$ using an encoder network (e.g. $f = G_f(x)$). 

The reason this is an issue is that the input $x$ is correlated with $d$ and this can confuse the model to extract features that capture differences in domains instead of differences in classes. Here I will recast the problem differently from in the paper:

**Problem:** Given a conditional probability $p(x|d=0)$ that may be different from $p(x|d=1)$:

$$p(x|d=0) \stackrel{?}{\ne} p(x|d=1)$$

we would like it to be the case that these distributions are equal.

$$p(G_f(x) |d=0) = p(G_f(x)|d=1)$$

aka:

$$p(f|d=0) = p(f|d=1)$$

Of course this is an issue if some class label $y$ is correlated with $d$ meaning that we may hurt the performance of a classifier that now may not be able to predict $y$ as well as before.

https://i.imgur.com/WR2ujRl.png

The paper proposes adding a domain classifier network to the feature vector using a reverse gradient layer. This layer simply flips the sign on the gradient. Here is an example in [Theano](https://github.com/Theano/Theano):

```
class ReverseGradient(theano.gof.Op):
    ...
    def grad(self, input, output_gradients):
        return [-output_gradients[0]]
```

You then train this domain network as if you want it to correctly predict the domain (appending it's error to your loss function). As the domain network learns new ways to correctly predict an output these gradients will be flipped and the information in feature vector $f$ will be removed.

There are two major hyper parameters of the method. The number of dimensions at the bottleneck is one but it is linked to your network. The second is a scalar on the gradient so you can increase or decrease the effect of the gradient on the embedding.

1 Comments

arxiv.org
arxiv-vanity.com
scholar.google.com

End-to-End Instance Segmentation and Counting with Recurrent Attention
Mengye Ren and Richard S. Zemel
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.CV
more

[link] Summary by Joseph Paul Cohen 9 years ago

This combines the ideas of recurrent attention to perform object detection in an image \cite{1406.6247} for multiple objects \cite{1412.7755} with semantic segmentation \cite{1505.04366}. 

Segmenting subregions is to avoid a global resolution bias (the object would take up the majority of pixels) and to allow multiple scales of objects to be segmented. 

Here is a video that demos the method described in the paper:

https://youtu.be/BMVDhTjEfBU

arxiv.org
arxiv-vanity.com
scholar.google.com

Hypercolumns for Object Segmentation and Fine-grained Localization
Bharath Hariharan and Pablo Arbeláez and Ross Girshick and Jitendra Malik
arXiv e-Print archive - 2014 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Joseph Paul Cohen 9 years ago

So the hypervector is just a big vector created from a network:

`"We concatenate features from some or all of the feature
maps in the network into one long vector for every location
which we call the hypercolumn at that location. As an
example, using pool2 (256 channels), conv4 (384 channels)
and fc7 (4096 channels) from the architecture of [28] would
lead to a 4736 dimensional vector."`

So how exactly do we construct the vector? 

![](https://i.imgur.com/hDvHRwT.png)

Each activation map results in a single element of the resulting hypervector. The corresponding pixel location in each activation map is used as if the activation maps were all scaled to the size of the original image.

The paper shows the below formula for the calculation. Here $\mathbf{f}_i$ is the value of the pixel in the scaled space and each $\mathbf{F}_{k}$ are points in the activation map. $\alpha_{ik}$ scales the known values to produce the midway points.

$$\mathbf{f}_i  = \sum_k \alpha_{ik} \mathbf{F}_{k}$$

Then the fully connected layers are simply appended to complete the vector. 

So this gives us a representation for each pixel but is it a good one? The later layers will have the input pixel in their receptive field. After the first few layers it is expected that the spatial constraint is not strong.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Learning to count with deep object features
Seguí, Santi and Pujol, Oriol and Vitrià, Jordi
Conference and Computer Vision and Pattern Recognition - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

This paper discusses some amazing results. The goal is to learn how to count by end-to-end training. The network input is an image and the output is a count of the objects inside it. They do not perform any direct training using the locations of the objects in the image. 

The reason for avoiding direct training is that labeled data is expensive. Employing a surrogate objective ,such as the count of items in the image, is much cheaper and makes more sense because it is the goal of the system we want to learn. This paper states that it is possible! The discuss experiments on two datasets; one of MNIST digits placed in an image and one with the UCSD Pedestrian Database.  

The network description seems to be general and they don't report any special constraints on the design  `"We consider networks of two or more convolutional layers followed by one or more fully connected layers. Each convolutional layer consist of several elements: a set of convolutional filters, ReLU non-linearities, max pooling layers and normalization layers."` and `"We use a five layers architecture CNN with two convolutional layers followed by three fully connected layers"`. They provide these two tables for their designs:

$$\begin{array}{c|c|c|c}
 Conv1 &  Conv2 & FC1 & FC2  \\ \hline
10\text{x}15\text{x}15 & 10\text{x}3\text{x}3 & 32 & 6 \\
\text{x2 pool} & \text{x2 pool} & & \\ \hline
\end{array}\\
\text{CNN arch for numbers}$$

$$
\begin{array}{c|c|c|c|c}
 Conv1 &  Conv2 & FC1 & FC2 & FC3 \\ \hline
8\text{x}9\text{x}9 & 8\text{x}5\text{x}5 & 128 & 128 & 25 \\
\text{x2 pool} & \text{x2 pool} & & \\ \hline
\end{array}\\
\text{CNN arch for people}$$

They state that they use a method based on hypercolumns \cite{1411.5752} but the description is not clear at all: `" Starting with the hypercolumn representation
on the last layer we cluster the resulting hypercolumns
into a set of prototypes using an online k-means
algorithm. Then, a MIL approach with positive and negative
instances with the concept of interest is used."`

![](https://i.imgur.com/x2q3E9Y.png)

Interesting work but I wish it was a longer paper with more details. This paper doesn't really give me enough information to reproduce it.

doi.acm.org
sci-hub
scholar.google.com

Efficient Activity Retrieval through Semantic Graph Queries
Castañón, Gregory D. and Chen, Yuting and Zhang, Ziming and Saligrama, Venkatesh
Special Interest Group in Multimedia (ACM SIGMM) - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

This paper poses the the problem of querying a large corpus of aerial video as a subgraph matching problem. Here the video data has been transformed into a large graph where each frame contains labeled objects such as person, object, or car which become nodes and then the edges are relationships such as time (between sequential video frames) and distance (in current frame and in future frames). 

The reason the graph is built is to we can query it with graphs that represent what we are looking for. The first example in the paper (below) shows an example query (called $Q$). This query asks to "Find a person near an object then after some time or distance they are still near and there is a car". 

![](http://i.imgur.com/6AKVCYX.png)

The goal now is to find the most similar subgraphs in the larger graph. The game here is to reduce the complexity of the search into something that is not as bad as the subgraph isomorphism problem. Even though this is worse because we what things that are similar and not necessarily exact to the query.

They filter the larger graph (that represents the video) into a smaller graph that only includes nodes and edges that can match those in the query graph (This graph is called the coarse graph $C$). Another is to filter the query $Q$ into a smaller graph $T$ which retains nodes and edges that have the most discriminative power.

### WORK IN PROGRESS

dx.doi.org
sci-hub
scholar.google.com

United States Health Care Reform
Barack Obama
JAMA - 2016 via Local CrossRef
Keywords:

[link] Summary by Joseph Paul Cohen 10 years ago

This paper discusses the impact of the Affordable Care Act (ACA) on the United States. First there is discussion regarding what led up to health care reform followed by the impacts of the bill and the current trends.

One motivation was that the "US system left more than 1 in 7 Americans without health insurance coverage in 2008." Also, there was an upward trend in how much the economy was spending on healthcare.

The first results are shown in Figure 1. The number of uninsured dropped after the ACA. Maybe this is the most significant take away but it doesn't capture the quality and cost of care which are also addressed in the paper.

![](https://i.imgur.com/qrQ7nU9.png)

The paper states "Before the ACA, the health care system was dominated by 'fee-for-service' payment systems, which often penalized health care organizations and health care professionals who find ways to deliver care more efficiently, while failing to reward those who improve the quality of care. The ACA has changed the health care payment system in several important ways."

The ACA modified payments for medicare services and introduced a "value-based payment" system with the goal of reducing overall cost. This is shown in Figure 4. What the plot is showing is the change over a time period. Between 2000-2005 the costs per payer were increasing in all programs. Between 2005-2010 medicaid cost to the person was decreasing and the other programs were still growing but growing slower. Between 2010-2014 the cost to both medicare and medicaid are both in decline. Private insurance costs are still increasing but they are increasing slower than before.

![](https://i.imgur.com/knBTQxK.png)

Another interesting take away is that "[t]he rate of hospital-acquired conditions (such as adverse drug events, infections, and pressure ulcers) has declined by 17%." This is reflected in the 30-day readmission rates dropping shown in Figure 6.

![](https://i.imgur.com/SmcxfoB.png)

www.cv-foundation.org
scholar.google.com

Reinforcement Learning for Visual Object Detection
Mathe, Stefan and Pirinen, Aleksis and Sminchisescu, Cristian
Conference and Computer Vision and Pattern Recognition - 2016 via Local
Keywords:

[link] Summary by Joseph Paul Cohen 10 years ago

![](http://i.imgur.com/tIX6HQB.jpg)

The goal of this paper is to find a specific object in an image. Initially a region proposal algorithm is used to identify candidate regions containing objects. The goal is to avoid processing all of these candidates. The idea here is to use RL to identify the neighboring candidates that should be used as a base to transform to get the next coordinates. 

Starting from the center, all candidates windows that are overlapped by a radius around the center are evaluated with the RL policy $\pi$. The state input to the $\pi$ function is a combination of the features extracted from a CNN as well as values to track the state of the search such as how many candidates have been evaluated. The candidate that is selected has it's features extracted and these features are then transformed into coordinates of where to look next. Then the processing is repeated for that next point until a proper classification is made or the algorithm decides to stop.

varcity.eu
sci-hub
www.cv-foundation.org
scholar.google.com

Scale-Aware Alignment of Hierarchical Image Segmentation
Chen, Yuhua and Dai, Dengxin and Pont-Tuset, Jordi and Gool, Luc Van
Conference and Computer Vision and Pattern Recognition - 2016 via Local
Keywords: computer, machine, learning, vision

[link] Summary by Joseph Paul Cohen 10 years ago

They represent an image as a tree where leafs are pixels and nodes represent clusters of those pixels. They train by regressing for some possible segmented region $r$ on the following function for every segmentation example and ground truth:
$$S(r)=\frac{\\#(g) - \\#(r)}{\max(\\#(r), \\#(g)))}$$

Here $\\#(g)$ is the number of pixels in the ground truth and $\\#(r)$ is the number of pixels in the example segmentation. What is not explained here is what other information is used because it cannot simple be pixel counts. This function is used to rank the nodes in every path from the root to the leafs in Figure (a). 

The idea for the segmentation is that there is some set of nodes such that you can draw a line shown in Figure (b) which is equivalent to selecting a segmentation. The paper goes on to compute this using a dynamic programming solution based on the fact that the same pixel segmentations will be considered multiple times.

![](http://i.imgur.com/FEky9dK.png)

I think the idea is great but the initial idea for the regression is unclear.

papers.nips.cc
scholar.google.com

Algorithms for Non-negative Matrix Factorization
Lee, Daniel D. and Seung, H. Sebastian
Neural Information Processing Systems Conference - 2000 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So 

$$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$

Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value.

$$
V = \left[\begin{array}{c c c}
5 & 4 & 1  \\\\
4 & 5 & 1 \\\\
2 & 1 & 5
\end{array}\right]
$$


We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues:


$$
W = \left[\begin{array}{c c c}
-0.656 \\\
 -0.652 \\\
 -0.379
\end{array}\right],
H = \left[\begin{array}{c c c}
-6.48 & -6.26 & -3.20\\\\
\end{array}\right]
$$

We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and  $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$):

$$
W = \left[\begin{array}{c c c}
0.388 \\\\
0.386 \\\\
0.224
\end{array}\right],
H = \left[\begin{array}{c c c}
11.22 & 10.57 & 5.41  \\\\
\end{array}\right]
$$

Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. 

$$
V \approx WH = \left[\begin{array}{c c c}
4.36 & 4.11 & 2.10 \\\
4.33 & 4.08 & 2.09 \\\
2.52 & 2.37 & 1.21 \\\
\end{array}\right]
$$


If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better`



#### Paper Contribution 

This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. 

The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$.



### Still a draft

arxiv.org
scholar.google.com

Diversity Networks
Mariet, Zelda and Sra, Suvrit
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

The goal is to compress a neural network based on figuring out the most significant neurons. They sample from Determinantal Point Process (DPP) in order to find set of neurons that have the most dissimilar activations and then project remaining neurons to them in order to reduce number of neurons overall.

DPPs compute the probability of volume of dissimilarity over volume of all neurons:

$$P(\text{subset } Y) = \frac{det(L_Y)}{det(L+I)}$$ 

More dissimilarity means higher probability. A simple sample of the neurons outputs are taken given the training set.

www.jmlr.org
scholar.google.com

Understanding the difficulty of training deep feedforward neural networks
Glorot, Xavier and Bengio, Yoshua
Journal of Machine Learning Research - 2010 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

The weights at each layer $W$ are initialized based on the number of connections they have. Each $w \in W$  is drawn from a Gaussian distribution with mean $\mu = 0$ with the variance as follows. 

$$\text{Var}(W) = \frac{2}{n_\text{in}+ n_\text{out}}$$

Where $n_\text{in}$ is the number of neurons in the previous layer from the feedforward direction and $n_\text{out}$ is the number of neurons from the previous layer from the backprop direction.
Reference: [Andy Jones's Blog](http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization)

doi.acm.org
sci-hub
scholar.google.com

How to Share a Secret
Shamir, Adi
Communications of the ACM - 1979 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

This paper defines a scheme to share a secret message with a complete group of people. It requires the group of $k$ people, no less, to combine their secret keys in order to obtain the shared secret. The secret shared is contained in the $a_0, .. a_{k-1}$ coefficients of a polynomial:

$$f(x)=a_0+a_1x+a_2x^2+\cdots+a_{k-1}x^{k-1}$$

There is a property of defining polynomials such that 2 points are sufficient to define a line, 3 points are sufficient to define a parabola, 4 points to define a cubic curve, etc. It takes $k$ points to define a polynomial of degree $k-1$

You can then give out $k$ pairs of input $x$ and output $f(x)$ examples. Given $k$ unique examples of an input $x$ and an output $f(x)$ you can determine what the coefficients were. But only having $k-1$ examples leaves a free variable and without added information it is impossible to know the coefficients. This means all $k$ people must provide their examples in order to determine the secret!

dx.doi.org
sci-hub
scholar.google.com

A Nonlinear Mapping for Data Structure Analysis
Sammon, J. W.
IEEE Computer Society IEEE Trans. Comput. - 1969 via Local Bibsonomy
Keywords: visualization, dimensionality_reduction

[link] Summary by Joseph Paul Cohen 10 years ago

This paper presents what is known as `Sammon's mapping`. This method produces points in any $\mathbb{R}^n$ space using only a distance function between points. You can define any distance function $d^*$ that represents relationships between points. This function can even be non-symmetric. The power is that any relationship encoded into a distance function or distance matrix can be visualized.

For mapping $n$ points from some dimension in another the algorithm starts by generating $n$ random points in the space (called d-space) that you would like to map the points to. You can just pick these at random because they will be moved later. 

The algorithm then performs gradient decent to minimize the *Sammon's stress* which can also be called the objective function.

$$
\text{Sammon's stress} = \frac{1}{
\sum\limits_{i<j} d^{*}_{ij}} 
\sum_{i<j}
\frac{ ( d^{*}_{ij}-d_{ij})^2}
{d^{*}_{ij}}
$$

To minimize this objective function a partial derivative is taken with respect to each dimension of each point in d-space. For each dimension $y$ the distance between points $p$ and $q$ are modified using a scaled partial derivative and a learning rate. The paper calls this a "magic factor" MF but it is referred to today as a learning rate $\lambda$.

$$y_{pq}' = y_{pq}-\lambda \Delta_{pq}$$

The partial is scaled by the second derivative:

$$\Delta_{pq}=\left.\frac{\partial E}{\partial y_{pq}} \middle/  \frac{\partial^2 E}{\partial y_{pq}^2}\right.$$

Using the second derivative might be overkill for this. The objective function should also be minimized using only the first derivative. Possibly using new update rules for stochastic optimization like \cite{conf/colt/DuchiHS10} or \cite{journals/corr/KingmaB14} may be more efficient.

dx.doi.org
sci-hub
scholar.google.com

Multilayer feedforward networks are universal approximators
Hornik, Kurt and Stinchcombe, Maxwell B. and White, Halbert
Neural Networks - 1989 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

This paper discusses the universal approximation theorem which states: There is a single hidden layer feedforward network that approximates any measurable function to any desired degree of accuracy.

For any unknown function $f(x)$ there exists a single hidden layer feedforward network $F(x)$ such that $  | F( x ) - f ( x ) | < \epsilon$ for some number of hidden units. 

$F(x)$ takes the following form where $h$ is some nonlinear activation function (relu, tanh, sigmoid). $w_i$ is a vector and $b_i$ and $v_i$ are scalars.

$$  F( x ) =
  \sum_{i=1}^{N} v_i h( w_i x + b_i)$$


Resources: 

http://deeplearning.cs.cmu.edu/notes/Sonia_Hornik.pdf

http://neuralnetworksanddeeplearning.com/chap4.html

dx.doi.org
sci-hub
scholar.google.com

Wrappers for Feature Subset Selection
Kohavi, Ron and John, George H.
Artificial Intelligence Journal - 1997 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

Feature subset selection can be categorized into embedded approaches, filter approaches, and wrapper approaches. This paper presents the wrapper subset selection problem and some algorithms to obtain good subsets. Wrapper subset selection methods are black-box optimization techniques. 

First let's look at what the wrapper search space looks like in the figure below. We want to find a subset of features which maximize the performance of our classification model so each node in the graph is a subset of all the features. For a set of $n$ features there are $2^n$ unique subsets. A wrapper method approaches the problem by only looking at the graph structure and optionally evaluating each node during a search to determine how well it performs. The edges in the graph represent adding and removing one feature from the subset.

![](http://i.imgur.com/is9WLJ9.png)

Kohavi presents two search algorithms hill-climbing and best-first search. Hill-climbing (aka greedy) evaluates all neighbor nodes and picks the best one to start searching from next. Best-first evaluates $k$ neighbors and then picks the best one.

www.jmlr.org
scholar.google.com

An Introduction to Variable and Feature Selection
Guyon, Isabelle and Elisseeff, André
Journal of Machine Learning Research - 2003 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

"The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data."

www.jmlr.org
scholar.google.com

Overfitting in Making Comparisons Between Variable Selection Methods
Reunanen, Juha
Journal of Machine Learning Research - 2003 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

This paper discusses an important bias in evaluation of methods using cross-validation. A method that makes decisions based of cross validation can appear to increase overall performance by simply dealing with the bias of cross-validation and not the real problem.

cseweb.ucsd.edu
scholar.google.com

Large Margin Classification Using the Perceptron Algorithm
Freund, Yoav and Schapire, Robert E.
Kluwer Academic Publishers Machine Learning - 1999 via Local Bibsonomy
Keywords: original, perceptron, kernel, shapire, averaged, voted

[link] Summary by Joseph Paul Cohen 10 years ago

This extends perceptrons \cite{books/daglib/0066902} but uses what is known as the Hinge loss (aka SVM loss):

$$J_i(w) = max(0,\gamma -y\_i f(x\_i))$$

Where $\gamma$ is the margin. $J_i(w)$ is the error given some weight $w$ parameters. $x_i$ and $y_i$ are a training example and correct label. $f(x_i)$ is the perceptron function we are trying learn the best weights for.

scholar.google.com

Perceptrons - an introduction to computational geometry
Minsky, Marvin and Papert, Seymour
MIT Press - 1987 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

### Perceptron Classification

The function of the perceptron takes this form for some weight vector $\vec{w}$ and bias scalar $b$. Given some input $x$ it will produce a binary prediction.

$$ f(x) = \left\{ \begin{matrix} 
1 & \text{if } (\vec{w} \cdot \vec{x} + b > 0)  \\
-1 & otherwise  \\
\end{matrix}\right. $$

### Perceptron Learning

The values $w$ and $b$ for this function are learned from the sample data by minimizing the misclassification error of predictions. Our sample data is in the form $(x\_i,y\_i)$ where $y\_i$ the correct label (1 or -1). If the output of $f(x\_i)$ is equal to $y\_i$ then multiplying $-y\_i f(x\_i)$ will be 1 or -1. If it is incorrect it will be 1. So we can take the $max$ of 0 and this product and then sum them all to get how bad $w$ and $b$ are! $J_i(w,b)$ is the error for that one example. We can sum these together to get the error over all samples.

$$J_i(w,b) = max(0,-y\_i f(x\_i))$$

$$J(w,b) = \frac{1}{N} \displaystyle\sum\_{i=1}^N max(0,-y\_i f(x\_i))$$

To apply Gradient Decent to this problem we calculate the gradient of $J_i(w,b)$ with respect to each $w\_j \in w$ so we can know how to adjust it to minimize $J_i(w,b)$ Because we have a $max$ this gradient is annoying and has a split.

$$ \frac{\partial J_i}{\partial w_j}= 
\left\{ \begin{matrix} 
0 & \text{if } (\vec{w} \cdot \vec{x} + b > 0)  \\
y\_ix\_{ij} & otherwise  \\
\end{matrix}\right. $$

This gradient $\frac{\partial J_i}{\partial w_j}$ is then used to adjust $w_j$. By subtracting $\frac{\partial J_i}{\partial w_j}$ from $w_j$ it will adjust the output of $f(x_i)$ such that the error $J_i(w,b)$ is reduced. Generally, subtracting the full gradient will not result in the minimal error. So a fraction of the gradient is subtracted $\lambda$ normally at a rate of $0.05$ but this term is still a point of debate and generally is set by experience.

dx.doi.org
sci-hub
scholar.google.com

Multi-scale Convolutional Neural Networks for Lung Nodule Classification
Shen, Wei and Zhou, Mu and Yang, Feng and Yang, Caiyun and Tian, Jie
Information Processing in Medical Imaging - 2015 via Local Bibsonomy
Keywords: dblp

2	[link] Summary by Joseph Paul Cohen 10 years ago They apply a CNN to detect nodules in a 2D section of a CT scan. The network has three input images at different scales. The networks are joined at a feature layer before a final output layer. more less

dx.doi.org
sci-hub
scholar.google.com

Fast R-CNN
Girshick, Ross B.
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14}

1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training.
2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell."
3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values.

This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15}

doi.acm.org
sci-hub
scholar.google.com

Academic Torrents: A Community-Maintained Distributed Repository
Cohen, Joseph Paul and Lo, Henry Z.
The Extreme Science and Engineering Discovery Environment Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

Academic Torrents is a BitTorrent service that aims to make it easy for academics to share data via BitTorrent. Specific use cases are during competitions where everyone needs access to data quickly. Also, when a dataset is not available anymore the data can be shared from simple desktop computers and become available globally.

dx.doi.org
sci-hub
scholar.google.com

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Girshick, Ross B. and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra
Conference and Computer Vision and Pattern Recognition - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

The R-CNN method is a way to localize objects in an image. It is restricted to finding one of each object in an image. 

1. Regions are generated based on any method including brute force sliding window.
2. Each region is classified using AlexNet.
3. The classifications for each label are searched to find the location which expresses that label the most.

dx.doi.org
sci-hub
scholar.google.com

Identifying Syntactic differences Between Two Programs
Yang, Wuu
Software: Practice and Experience Journal - 1991 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

This is a dynamic programming algorithm known as *Simple Tree Matching*:

```
int simple_tree_match(a,b){

    if (a != b) return 0

    m = the number of first-level sub-trees of a
    n = the number of first-level sub-trees of b
    M[i,0] := 0 for i = 0,...,m
    M[0,j] := 0 for j = 0,...,n
    for(i := 1 to m){
        for(i := 1 to n){
            x := M[i,j-1]
            y := M[i-1,j]
            z := M[i-1,j-1]+ simple_tree_match(a_i,b_j)
            M[i,j] = max(x,y,z)
        }
    }
}
return M[m,n] + 1
}
```

dx.doi.org
sci-hub
scholar.google.com

Comparing semantically-blind and semantically-aware landscape similarity measures with application to query-by-content and regionalization
Stepinski, Tomasz F. and Cohen, Joseph Paul
Ecological Informatics Journal - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

This paper talks about how to compare National Land Cover Database data which can be represented as histograms. The challenge is scaling the computations. The question this paper asks is if a semantically aware histogram comparison is worth the extra computation. It turns out that is does not appear worth it but interesting findings are discussed. 

![](http://i.imgur.com/5mKnkQw.png)

arxiv.org
scholar.google.com

Going Deeper with Convolutions
Szegedy, Christian and Liu, Wei and Jia, Yangqing and Sermanet, Pierre and Reed, Scott and Anguelov, Dragomir and Erhan, Dumitru and Vanhoucke, Vincent and Rabinovich, Andrew
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

This paper introduces the GoogLeNet Inception Architecture The major part of this paper is the *Inception Module* which takes convolutions at multiple layers and provides a good receptive field as well as reducing the overall number of parameters.

![Inception Module](http://i.imgur.com/CfmUmUB.png)

arxiv.org
scholar.google.com

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

This summary is as ridiculous as this network is long. A good implementation of the network is here: https://github.com/dmlc/mxnet/blob/master/example/image-classification/symbol_resnet-28-small.py


Here is a visualization of this crazy network:

![](http://josephpcohen.com/w/wp-content/uploads/resnet-28-small.png)

jmlr.org
scholar.google.com

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Ioffe, Sergey and Szegedy, Christian
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

A *Batch Normalization* applied immediately after fully connected layers and adjusts the values of the feedforward output so that they are centered to a zero mean and have unit variance.

It has been used by famous Convolutional Neural Networks such as GoogLeNet \cite{journals/corr/SzegedyLJSRAEVR14} and ResNet \cite{journals/corr/HeZRS15}

arxiv.org
scholar.google.com

Adam: A Method for Stochastic Optimization
Kingma, Diederik P. and Ba, Jimmy
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

Adam is like RMSProp with momentum. The (simplified) update [[Stanford CS231n]](https://cs231n.github.io/neural-networks-3/#ada) looks as follows:

```
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)
```

arxiv.org
scholar.google.com

Wireless Message Dissemination via Selective Relay over Bluetooth (MDSRoB)
Cohen, Joseph Paul
arXiv e-Print archive - 2013 via Local Bibsonomy
Keywords: dblp

4	[link] Summary by Joseph Paul Cohen 10 years ago This paper proposes a method to send messages between cell phones over Bluetooth by using the device name field. This allows devices to communicate directly with each other without pairing. more less

colt2010.haifa.il.ibm.com
sci-hub
scholar.google.com

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Duchi, John C. and Hazan, Elad and Singer, Yoram
Conference on Learning Theory - 2010 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

This is Adagrad. Adagrad is an adaptive learning rate method. Some sample code from  [[Stanford CS231n]](https://cs231n.github.io/neural-networks-3/#ada) is:

```python
# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
```

arxiv.org
scholar.google.com

Semi-Supervised Web Wrapper Repair via Recursive Tree Matching
Cohen, Joseph Paul and 0003, Wei Ding and Bagherjeiran, Abraham
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

This idea is so badass! It uses Simple Tree Matching \cite{journals/spe/Yang91} and extends it to work with HTML and then recursively searches an unseen document to align it with previously seen examples. An overview of the problem of *shift* can be seen on the left of the figure below and  the alignment is shown on the right.

http://i.imgur.com/b8EzP42.png

dx.doi.org
sci-hub
scholar.google.com

Prediction gradients for feature extraction and analysis from convolutional neural networks
Lo, Henry Z. and Cohen, Joseph Paul and Ding, Wei
Conference on Automatic Face and Gesture Recognition - 2015 via Local Bibsonomy
Keywords: dblp

3	[link] Summary by Joseph Paul Cohen 10 years ago The prediction gradient is just $\frac{\partial \mathbf{y}}{\partial w}$ where $\mathbf{y}$ is the output before the loss function. more less

arxiv.org
scholar.google.com

RandomOut: Using a convolutional gradient norm to win The Filter Lottery
Cohen, Joseph Paul and Lo, Henry Z. and Ding, Wei
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 10 years ago

Basically they observe a pattern they call The Filter Lottery (TFL) where the random seed causes a high variance  in the training accuracy:

![](http://i.imgur.com/5rWig0H.png)

They use the convolutional gradient norm ($CGN$) \cite{conf/fgr/LoC015} to determine how much impact a filter has on the overall classification loss function by taking the derivative of the loss function with respect each weight in the filter.

$$CGN(k) = \sum_{i} \left|\frac{\partial L}{\partial w^k_i}\right|$$

They use the CGN to evaluate the impact of a filter on error, and re-initialize filters when the gradient norm of its weights falls below a specific threshold.

Joseph Paul Cohen

sciscore: 1.582