ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Towards Robust Evaluations of Continual Learning
Sebastian Farquhar and Yarin Gal
arXiv e-Print archive - 2018 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by Natalia Diaz Rodriguez, PhD 7 years ago

Through a likelihood-focused derivation of a variational inference (VI) loss, Variational Generative Experience Replay (VGER) presents the closest appropriate likelihood- focused alternative to Variational Continual Learning (VCL), the state-of the art prior-focused approach to continual learning.


In non continual learning, the aim is to learn parameters $\omega$ using labelled training data $\mathcal{D}$ to infer $p(y|\omega, x)$. In the continual learning context, instead, the data is not independently and identically distributed (i.i.d.), but may be split into separate tasks $\mathcal{D}_t = (X_t, Y_t)$ whose examples $x_t^{n_t}$ and $y_t^{n_t}$ are assumed to be i.i.d. 


In \cite{Farquhar18}, as the loss at time $t$ cannot be estimated for previously discarded datasets, to approximate the distribution of past datasets $p_t(x,y)$, VGER (Variational Generative Experience Replay) trains a GAN $q_t(x, y)$ to produce ($\hat{x}, \hat{y}$) pairs for each class in each dataset as it arrives (generator is kept while data is discarded after each dataset is used). The variational free energy $\mathcal{F}_T$ is used to train on dataset $\mathcal{D}_T$ augmented with samples generated by the GAN. In this way the prior is set as the posterior approximation from the previous task.

openreview.net
scholar.google.com

SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models
Yucen Luo, Alex Beatson, Mohammad Norouzi, Jun Zhu, David Duvenaud, Ryan P. Adams, Ricky T. Q. Chen
International Conference on Learning Representations - 2020 via Local Manual
Keywords:

[link] Summary by Chin-Wei 6 years ago

In this note, I'll implement the [Stochastically Unbiased Marginalization Objective (SUMO)](https://openreview.net/forum?id=SylkYeHtwr) to estimate the log-partition function of an energy funtion. 

Estimation of log-partition function has many important applications in machine learning. Take latent variable models or Bayeisian inference. The log-partition function of the posterior distribution $$p(z|x)=\frac{1}{Z}p(x|z)p(z)$$ is the log-marginal likelihood of the data $$\log Z = \log \int p(x|z)p(z)dz = \log p(x)$$. 

More generally, let $U(x)$ be some energy function which induces some density function $p(x)=\frac{e^{-U(x)}}{\int e^{-U(x)} dx}$. The common practice is to look at a variational form of the log-partition function, 
$$
\log Z = \log \int e^{-U(x)}dx = \max_{q(x)}\mathbb{E}[-U(x)-\log q(x)] \nonumber
$$
Plugging in an arbitrary $q$ would normally yield a strict lower bound, which means 
$$
\frac{1}{n}\sum_{i=1}^n \left(-U(x_i) - \log q(x_i)\right) \nonumber
$$
for $x_i$ sampled *i.i.d.* from $q$, would be a biased estimate for $\log Z$. In particular, it would be an underestimation. 



To see this, lets define the energy function $U$ as follows:
$$
U(x_1,x_2)= - \log \left(\frac{1}{2}\cdot e^{-\frac{(x_1+2)^2 + x_2^2}{2}} + \frac{1}{2}\cdot\frac{1}{4}e^{-\frac{(x_1-2)^2 + x_2^2}{8}}\right) \nonumber
$$
It is not hard to see that $U$ is the energy function of a mixture of Gaussian distribution $\frac{1}{2}\mathcal{N}([-2,0], I) + \frac{1}{2}\mathcal{N}([2,0], 4I)$ with a normalizing constant $Z=2\pi\approx6.28$ and $\log Z\approx1.8379$.

```python
def U(x):
  x1 = x[:,0]
  x2 = x[:,1]
  d2 = x2 ** 2
  return - np.log(np.exp(-((x1+2) ** 2 + d2)/2)/2 + np.exp(-((x1-2) ** 2 + d2)/8)/4/2)
```



To visualize the density corresponding to the energy $p(x)\propto e^{-U(x)}$

```python
xx = np.linspace(-5,5,200)
yy = np.linspace(-5,5,200)
X = np.meshgrid(xx,yy)
X = np.concatenate([X[0][:,:,None], X[1][:,:,None]], 2).reshape(-1,2)
unnormalized_density = np.exp(-U(X)).reshape(200,200)
plt.imshow(unnormalized_density)
plt.axis('off')
```

https://i.imgur.com/CZSyIQp.png



As a sanity check, lets also visualize the density of the mixture of Gaussians. 

```python
N1, N2 = mvn([-2,0], 1), mvn([2,0], 4)
density = (np.exp(N1.logpdf(X))/2 + np.exp(N2.logpdf(X))/2).reshape(200,200)
plt.imshow(density)
plt.axis('off')
print(np.allclose(unnormalized_density / density - 2*np.pi, 0))
```

`True`

https://i.imgur.com/g4inQxB.png



Now if we estimate the log-partition function by estimating the variational lower bound, we get

```python
q = mvn([0,0],5)

xs = q.rvs(10000*5)
elbo = - U(xs) - q.logpdf(xs)
plt.hist(elbo, range(-5,10))
print("Estimate:  %.4f  / Ground true:  %.4f" % (elbo.mean(), np.log(2*np.pi)))
print("Empirical variance: %.4f" % elbo.var())
```

`Estimate:  1.4595  / Ground true:  1.8379`

`Empirical variance: 0.9921`

https://i.imgur.com/vFzutuY.png



The lower bound can be tightened via [importance sampling):
$$
\log \int e^{-U(x)} dx \geq \mathbb{E}_{q^K}\left[\log\left(\frac{1}{K}\sum_{j=1}^K \frac{e^{-U(x_j)}}{q(x_j)}\right)\right] \nonumber
$$

> This bound is tighter for larger $K$ partly due to the [concentration of the average](https://arxiv.org/pdf/1906.03708.pdf) inside of the $\log$ function: when the random variable is more deterministic, using a local linear approximation near its mean is more accurate as there's less "mass" outside of some neighborhood of the mean.  



Now if we use this new estimator with $K=5$

```python
k = 5
xs = q.rvs(10000*k)
elbo = - U(xs) - q.logpdf(xs)
iwlb = elbo.reshape(10000,k)
iwlb = np.log(np.exp(iwlb).mean(1))
plt.hist(iwlb, range(-5,10))
print("Estimate:  %.4f  / Ground true:  %.4f" % (iwlb.mean(), np.log(2*np.pi)))
print("Empirical variance: %.4f" % iwlb.var())
```

`Estimate:  1.7616  / Ground true:  1.8379`

`Empirical variance: 0.1544`

https://i.imgur.com/sCcsQd4.png

We see that both the bias and variance decrease. 



Finally, we use the [Stochastically Unbiased Marginalization Objective](https://openreview.net/pdf?id=SylkYeHtwr) (SUMO), which uses the *Russian Roulette* estimator to randomly truncate a telescoping series that converges in expectation to the log partition function. Let $\text{IWAE}_K = \log\left(\frac{1}{K}\sum_{j=1}^K \frac{e^{-U(x_j)}}{q(x_j)}\right)$ be the importance-weighted estimator, and $\Delta_K = \text{IWAE}_{K+1} - \text{IWAE}_K$ be the difference (which can be thought of as some form of correction). The SUMO estimator is defined as 
$$
\text{SUMO} = \text{IWAE}_1 + \sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \nonumber
$$
where $K\sim p(K)=\mathbb{P}(\mathcal{K}=K)$. To see why this is an unbiased estimator,
$$
\begin{align*}
\mathbb{E}[\text{SUMO}] &= \mathbb{E}\left[\text{IWAE}_1 + \sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \right] \nonumber\\
&= \mathbb{E}_{x's}\left[\text{IWAE}_1 + \mathbb{E}_{K}\left[\sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \right]\right] \nonumber
\end{align*}
$$
The inner expectation can be further expanded
$$
\begin{align*}
\mathbb{E}_{K}\left[\sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \right]
&= \sum_{K=1}^\infty P(K)\sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \\
&= \sum_{k=1}^\infty \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \sum_{K=k}^\infty P(K) \\
&= \sum_{k=1}^\infty \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \mathbb{P}(\mathcal{K}\geq k) \\
&= \sum_{k=1}^\infty\Delta_K \\
&= \text{IWAE}_{2} - \text{IWAE}_1 + \text{IWAE}_{3} - \text{IWAE}_2 + ... = \lim_{k\rightarrow\infty}\text{IWAE}_{k}-\text{IWAE}_1
\end{align*}
$$
which shows $\mathbb{E}[\text{SUMO}] = \mathbb{E}[\text{IWAE}_\infty] = \log Z$. 

>  (N.B.) Some care needs to be taken care of for taking the limit. See the paper for more formal derivation.



A choice of $P(K)$ proposed in the paper satisfy $\mathbb{P}(\mathcal{K}\geq K)=\frac{1}{K}$. We can sample such a $K$ easily using the [inverse CDF](https://en.wikipedia.org/wiki/Inverse_transform_sampling),  $K=\lfloor\frac{u}{1-u}\rfloor$ where $u$ is sampled uniformly from the interval $[0,1]$. 



Now putting things all together, we can estimate the log-partition using SUMO. 

```python
count = 0
bs = 10
iwlb = list()
while count <= 1000000:
  u = np.random.rand(1)
  k = np.ceil(u/(1-u)).astype(int)[0]
  xs = q.rvs(bs*(k+1))
  elbo = - U(xs) - q.logpdf(xs)
  iwlb_ = elbo.reshape(bs, k+1)
  iwlb_ = np.log(np.cumsum(np.exp(iwlb_), 1) / np.arange(1,k+2))
  iwlb_ = iwlb_[:,0] + ((iwlb_[:,1:k+1] - iwlb_[:,0:k]) * np.arange(1,k+1)).sum(1)
  count += bs * (k+1)
  iwlb.append(iwlb_)

iwlb = np.concatenate(iwlb)
plt.hist(iwlb, range(-5,10))
print("Estimate:  %.4f  / Ground true:  %.4f" % (iwlb.mean(), np.log(2*np.pi)))
print("Empirical variance: %.4f" % iwlb.var())
```

`Estimate:  1.8359  / Ground true:  1.8379`

`Empirical variance: 4.1794`

https://i.imgur.com/04kPKo5.png

Indeed the empirical average is quite close to the true log-partition of the energy function. However we can also see that the distribution of the estimator is much more spread-out. In fact, it is very heavy-tailed. Note that I did not tune the proposal distribution $q$ based on the ELBO, or IWAE or SUMO. In the paper, the authors propose to tune $q$ to minimize the variance of the $\text{SUMO}$ estimator, which might be an interesting trick to look at next. 

(Reposted, see more details and code from https://www.chinweihuang.com/pages/sumo)

arxiv.org
arxiv-vanity.com
scholar.google.com

Discovering Reinforcement Learning Algorithms
Junhyuk Oh and Matteo Hessel and Wojciech M. Czarnecki and Zhongwen Xu and Hado van Hasselt and Satinder Singh and David Silver
arXiv e-Print archive - 2020 via Local arXiv
Keywords: cs.LG, cs.AI
more

[link] Summary by CodyWild 5 years ago

This work attempts to use meta-learning to learn an update rule for a reinforcement learning agent. In this context, "learning an update rule" means learning the parameters of an LSTM module that takes in information about the agent's recent reward and current model and outputs two values - a scalar and a vector - that are used to update the agent's model. I'm not going to go too deep into meta-learning here, but, at a high level, meta learning methods optimize parameters governing an agent's learning, and, over the course of many training processes over many environments, optimize those parameters such that the reward over the full lifetime of training is higher. 

To be more concrete, the agent in a given environment learns two things: 

- A policy, that is, a distribution over predicted action given a state.
- A "prediction vector". This fits in the conceptual slot where most RL algorithms would learn some kind of value or Q function, to predict how much future reward can be expected from a given state. However, in this context, this vector is *very explicitly* not a value function, but is just a vector that the agent-model generates and updates. The notion here is that maybe our human-designed construction of a value function isn't actually the best quantity for an agent to be predicting, and, if we meta-learn, we might find something more optimal. I'm a little bit confused about the structure of this vector, but I think it's *intended* to be a categorical 1-of-m prediction

At each step, after acting in the environment, the agent passes to an LSTM: 

- The reward at the step
- A binary of whether the trajectory is done
- The discount factor
- The probability of the action that was taken from state t
- The prediction vector evaluated at state t
- The prediction vector evaluated at state t+1

Given that as input (and given access to its past history from earlier in the training process), the LSTM predicts two things: 

- A scalar, pi-hat
- A prediction vector, y-hat

These two quantities are used to update the existing policy and prediction model according to the rule below.

https://i.imgur.com/xx1W9SU.png

 Conceptually, the scalar governs whether to increase or decrease probability assigned to the taken action under the policy, and y-hat serves as a target for the prediction vector to be pulled towards.  An important thing to note about the LSTM structure is that none of the quantities it takes as input are dependent on the action or observation space of the environment, so, once it is learned it can (hopefully) generalize to new environments. 

Given this, the basic meta learning objective falls out fairly easily - optimize the parameters of the LSTM to maximize lifetime reward, taken in expectation over training runs.  However, things don't turn out to be quite that easy. The simplest version of this meta-learning objective is wildly unstable and difficult to optimize, and the authors had to add a number of training hacks in order to get something that would work. (It really is dramatic, by the way, how absolutely essential these are to training something that actually learns a prediction vector). These include: 

- A entropy bonus, pushing the meta learned parameters to learn policies and prediction vectors that have higher entropy (which is to say: are less deterministic)
- An L2 penalty on both pi-hat and y-hat
- A removal of the softmax that had originally been originally taken over the k-dimensional prediction vector categorical, and switching that target from a KL divergence to a straight mean squared error loss. As far as I can tell, this makes the prediction vector no longer actually a 1-of-k categorical, but instead just a continuous vector, with each value between 0 and 1, which makes it make more sense to think of k separate binaries? This I was definitely confused about in the paper overall

https://i.imgur.com/EL8R1yd.png

With the help of all of these regularizers, the authors were able to get something that trained, and that appeared to be able to perform comparably to or better than A2C - the human-designed baseline - across the simple grid-worlds it was being trained in. However, the two most interesting aspects of the evaluation were: 

1. The authors showed that, given the values of the prediction vector, you could predict the true value of a state quite well, suggesting that the vector captured most of the information about what states were high value. However, beyond that, they found that the meta-learned vector was able to be used to predict the value calculated with discount rates different that than one used in the meta-learned training, which the hand-engineered alternative, TD-lambda, wasn't able to do (it could only well-predict values at the same discount rate used to calculate it). This suggests that the network really is learning some more robust notion of value that isn't tied to a specific discount rate. 

2. They also found that they were able to deploy the LSTM update rule learned on grid worlds to Atari games, and have it perform reasonably well - beating A2C in a few cases, though certainly not all. This is fairly impressive, since it's an example of a rule learned on a different, much simpler set of environments generalizing to more complex ones, and suggests that there's something intrinsic to Reinforcement Learning that it's capturing

doi.org
sci-hub
scholar.google.com

Fast Instance and Semantic Segmentation Exploiting Local Connectivity, Metric Learning, and One-Shot Detection for Robotics
Milioto, Andres and Mandtler, Leonard P. and Stachniss, Cyrill
International Conference on Robotics and Automation - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hadrien Bertrand 6 years ago

The paper proposes a method to perform joint instance and semantic segmentation. The method is fast as it is meant to run in an embedded environment (such as a robot). While the semantic map may seem redundant given the instance one, it is not as semantic segmentation is a key part of obtaining the instance map.

# Architecture

![image](https://user-images.githubusercontent.com/8659132/63187959-24cdb380-c02e-11e9-9121-77e0923e91c6.png)

The image is first put through a typical CNN encoder (specifically a ResNet derivative), followed by 3 separate decoders. The output of the decoder is at a low resolution for faster processing.

Decoders:
- Semantic segmentation: coupled with the encoder, it's U-Net-like. The output is a segmentation map.
- Instance center: for each pixel, outputs the confidence that it is the center of an object.
- Embedding: for each pixel, computes a 32 dimensional embedding. This embedding must have a low distance to embedding of other pixels of the same instance, and high distance to embedding of other pixels.

To obtain the instance map, the segmentation map is used to mask the other 2 decoder outputs to separate the embeddings and centers of each class. Centers are thresholded at 0.7, and centers with embedding distances lower than a set amount are discarded, as they are considered duplicates.

Then for each class, a similarity matrix is computed between all pixels from that class and centers from that class. Pixels are assigned to their closest centers, which represent different instances of the class.

Finally, the segmentation and instance maps are upsampled using the SLIC algorithm.

# Loss

There is one loss for each decoder head.
- Semantic segmentation: weighted cross-entropy
- Instance center: cross-entropy term modulated by a $\gamma$ parameter to counter the over-representation of the background over the target classes.
![image](https://user-images.githubusercontent.com/8659132/63286485-22659680-c286-11e9-9134-f1b823a34217.png)

- Embedding: composed of 3 parts, an attracting force between embeddings of the same instance, a repelling force between embeddings of different instances, and a l2 regularization on the embedding.
![image](https://user-images.githubusercontent.com/8659132/63286399-f1856180-c285-11e9-9136-feb6c4a555e5.png)
![image](https://user-images.githubusercontent.com/8659132/63286411-fcd88d00-c285-11e9-939f-0771579d8263.png)
$\hat{e}$ are the embeddings, $\delta_a$ is an hyper-parameter defining "close enough", and $\delta_b$ defines "far enough"

The whole model is trained jointly using a weighted sum of the 3 losses.

# Experiments and results

The authors test their method on the Cityscape dataset, which is composed of 5000 annotated images and 8 instance classes. They compare their methods both for semantic segmentation and instance segmentation.

![image](https://user-images.githubusercontent.com/8659132/63287573-a882dc80-c288-11e9-83e0-b352e43bdf28.png)

For semantic segmentation, their method is ok, though ENet for example performs better on average and is much faster.

![image](https://user-images.githubusercontent.com/8659132/63287643-d700b780-c288-11e9-9d40-5bcaf695a744.png)

On the other hand, for instance segmentation, their method is much faster than the other while still performing well. Not SOTA on performance, but considering the real-time constraint, it's much better.

# Comments

- Most instance segmentation methods tend to be sluggish and overly complicated. This approach is much more elegant in my opinion.
- If they removed the aggressive down/up sampling, I wonder if they would beat MaskRCNN and PANet.
- I'm not sure what's the point of upsampling the semantic map given that we already have the instance map.

arxiv.org
scholar.google.com

GROVER: Self-supervised Message Passing Transformer on Large-scale Molecular Data
Rong, Yu and Bian, Yatao and Xu, Tingyang and Xie, Weiyang and Wei, Ying and Huang, Wenbing and Huang, Junzhou
arXiv e-Print archive - 2020 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 5 years ago

Large-scale transformers on unsupervised text data have been wildly successful in recent years; arguably, the most successful single idea in the last ~3 years of machine learning. Given that, it's understandable that different domains within ML want to take their shot at seeing whether the same formula will work for them as well. This paper applies the principles of (1) transformers and (2) large-scale unlabeled data to the problem of learning informative embeddings of molecular graphs. 

Labeling is a problem in much of machine learning - it's costly, and narrowly defined in terms of a certain task - but that problem is even more exacerbated when it comes to labeling properties of molecules, since they typically require wetlab chemistry to empirically measure. Given that, and also given the fact that we often want to predict new properties - like effectiveness against a new targetable drug receptor - that we don't yet have data for, finding a way to learn and transfer from unsupervised data has the potential to be quite valuable in the molecular learning sphere. 

There are two main conceptual parts to this paper and its method - named GROVER, in true-to-ML-form tortured acronym style. The first is the actual architecture of their model itself, which combines both a message-passing Graph Neural Network to aggregate local information, and a Transformer to aggregate global information. The paper was a bit vague here, but the way I understand it is: 

https://i.imgur.com/JY4vRdd.png
- There are parallel GNN + Transformer stacks for both edges and nodes, each of which outputs both a node and edge embedding, for four embeddings total. I'll describe the one for nodes, and the parallel for edges operates the same way, except that hidden states live on edges rather than nodes, and attention is conducted over edges rather than nodes
- In the NodeTransformer version, a message passing NN (of I'm not sure how many layers) performs neighborhood aggregation (aggregating the hidden states of neighboring nodes and edges, then weight-transforming them, then aggregating again) until each node has a representation that has "absorbed" in information from a few hops out of its surrounding neighborhood. My understanding is that there is a separate MPNN for queries, keys, and values, and so each nodes end up with three different vectors for these three things.
- Multi-headed attention is then performed over these node representations, in the normal way, where all keys and queries are dot-product-ed together, and put into a softmax to calculate a weighted average over the values
- We now have node-level representations that combine both local and global information. These node representations are then aggregated into both node and edge representations, and each is put into a MLP layer and Layer Norm before finally outputting a node-based node and edge representation. This is then joined by an edge-based node and edge representation from the parallel stack. These are aggregated on a full-graph level to predict graph-level properties

https://i.imgur.com/NNl6v4Y.png

The other component of the GROVER model is the way this architecture is actually trained - without explicit supervised labels. The authors use two tasks - one local, and one global. The local task constructs labels based on local contextual properties of a given atom - for example, the atom here has one double-bonded Nitrogen and one single-bonded Oxygen in its local environment - and tries to predict those labels given the representations of that atom (or node). The global task uses RDKit (an analytically constructed molecular analysis kit) to identify 85 different modifs or functional groups in the molecule, and encodes those into an 85-long one-hot vector that is being predicted on a graph level. 

https://i.imgur.com/jzbYchA.png

With these two components, GROVER is pretrained on 10 million unlabeled molecules, and then evaluated in transfer settings where its representations are fine-tuned on small amounts of labeled data. The results are pretty impressive - it achieves new SOTA performance by relatively large amounts on all tasks, even relative to exist semi-supervised pretraining methods that similarly have access to more data. The authors perform ablations to show that it's important to do the graph-aggregation step before a transformer (the alternative being just doing a transformer on raw node and edge features), and also show that their architecture without pretraining (just used directly in downstream tasks) also performs worse. One thing I wish they'd directly ablated was the value-add of the local (also referred to as "contextual") and global semi-supervised tasks. Naively, I'd guess that most of the performance gain came from the global task, but it's hard to know without them having done the test directly.