ShortScience.org - Making Science Accessible!

28

[link] Summary by Hugo Larochelle 9 years ago

This paper presents a recurrent neural network architecture in which some of the recurrent weights dynamically change during the forward pass, using a hebbian-like rule. They correspond to the matrices $A(t)$ in the figure below:

![Fast weights RNN figure](http://i.imgur.com/DCznSf4.png)

These weights $A(t)$ are referred to as *fast weights*. Comparatively, the recurrent weights $W$ are referred to as slow weights, since they are only changing due to normal training and are otherwise kept constant at test time.

More specifically, the proposed fast weights RNN compute a series of hidden states $h(t)$ over time steps $t$, but, unlike regular RNNs, the transition from $h(t)$ to $h(t+1)$ consists of multiple ($S$) recurrent layers $h_1(t+1), \dots, h_{S-1}(t+1), h_S(t+1)$, defined as follows:

$$h_{s+1}(t+1) = f(W h(t) + C x(t) + A(t) h_s(t+1))$$

where $f$ is an element-wise non-linearity such as the ReLU activation. The next hidden state $h(t+1)$ is simply defined as the last "inner loop" hidden state $h_S(t+1)$, before moving to the next time step. 

As for the fast weights $A(t)$, they too change between time steps, using the hebbian-like rule:

$$A(t+1) = \lambda A(t) + \eta h(t) h(t)^T$$

where $\lambda$ acts as a decay rate (to partially forget some of what's in the past)  and $\eta$ as the fast weight's "learning rate" (not to be confused with the learning rate used during backprop). Thus, the role played by the fast weights is to rapidly adjust to the recent hidden states and remember the recent past.

In fact, the authors show an explicit relation between these fast weights and memory-augmented architectures that have recently been popular. Indeed, by recursively applying and expending the equation for the fast weights, one obtains

$$A(t) = \eta \sum_{\tau = 1}^{\tau = t-1}\lambda^{t-\tau-1} h(\tau) h(\tau)^T$$

*(note the difference with Equation 3 of the paper... I think there was a typo)* which implies that when computing the $A(t) h_s(t+1)$ term in the expression to go from $h_s(t+1)$ to $h_{s+1}(t+1)$, this term actually corresponds to

$$A(t) h_s(t+1) = \eta \sum_{\tau =1}^{\tau = t-1} \lambda^{t-\tau-1} h(\tau) (h(\tau)^T h_s(t+1))$$

i.e. $A(t) h_s(t+1)$ is a weighted sum of all previous hidden states $h(\tau)$, with each hidden states weighted by an "attention weight" $h(\tau)^T h_s(t+1)$. The difference with many recent memory-augmented architectures is thus that the attention weights aren't computed using a softmax non-linearity.

Experimentally, they find it beneficial to use [layer normalization](https://arxiv.org/abs/1607.06450). Good values for $\eta$ and $\lambda$ seem to be 0.5 and 0.9 respectively. I'm not 100% sure, but I also understand that using $S=1$, i.e. using the fast weights only once per time steps, was usually found to be optimal. Also see Figure 3 for the architecture used on the image classification datasets, which is slightly more involved.

The authors present a series 4 experiments, comparing with regular RNNs (IRNNs, which are RNNs with ReLU units and whose recurrent weights are initialized to a scaled identity matrix) and LSTMs (as well as an associative LSTM for a synthetic associative retrieval task and ConvNets for the two image datasets). Generally, experiments illustrate that the fast weights RNN tends to train faster (in number of updates) and better than the other recurrent architectures. Surprisingly, the fast weights RNN can even be competitive with a ConvNet on the two image classification benchmarks, where the RNN traverses glimpses from the image using a fixed policy.

**My two cents**

This is a very thought provoking paper which, based on the comparison with LSTMs, suggests that fast weights RNNs might be a very good alternative. I'd be quite curious to see what would happen if one was to replace LSTMs with them in the myriad of papers using LSTMs (e.g. all the Seq2Seq work). Intuitively, LSTMs seem to be able to do more than just attending to the recent past. But, for a given task, if one was to observe that fast weights RNNs are competitive to LSTMs, it would suggests that the LSTM isn't doing something that much more complex. So it would be interesting to determine what are the tasks where the extra capacity of an LSTM is actually valuable and exploitable. Hopefully the authors will release some code, to facilitate this exploration. 

The discussion at the end of Section 3 on how exploiting the "memory augmented" view of fast weights is useful to allow the use of minibatches is interesting. However, it also suggests that computations in the fast weights RNN scales quadratically with the sequence size (since in this view, the RNN technically must attend to all previous hidden states, since the beginning of the sequence). This is something to keep in mind, if one was to consider applying this to very long sequences (i.e. much longer than the hidden state dimensionality).

Also, I don't quite get the argument that the "memory augmented" view of fast weights is more amenable to mini-batch training. I understand that having an explicit weight matrix $A(t)$ for each minibatch sequence complicates things. However, in the memory augmented view, we also have a "memory matrix" that is different for each sequence, and yet we can handle that fine. The problem I can imagine is that storing a *sequence of arbitrary weight matrices* for each sequence might be storage demanding (and thus perhaps make it impossible to store a forward/backward pass for more than one sequence at a time), while the implicit memory matrix only requires appending a new row at each time step. Perhaps the argument to be made here is more that there's already mini-batch compatible code out there for dealing with the use of a memory matrix of stored previous memory states.

This work strikes some (partial) resemblance to other recent work, which may serve as food for thought here. The use of possibly multiple computation layers between time steps reminds me of [Adaptive Computation Time (ACT) RNN]( http://www.shortscience.org/paper?bibtexKey=journals/corr/Graves16). Also, expressing a backpropable architecture that involves updates to weights (here, hebbian-like updates) reminds me of recent work that does backprop through the updates of a gradient descent procedure (for instance as in [this work]( http://www.shortscience.org/paper?bibtexKey=conf/icml/MaclaurinDA15)). 

Finally, while I was familiar with the notion of fast weights from the work on [Using Fast Weights to Improve Persistent Contrastive Divergence](http://people.ee.duke.edu/~lcarin/FastGibbsMixing.pdf), I didn't realize that this concept dated as far back as the late 80s. So, for young researchers out there looking for inspiration for research ideas, this paper confirms that looking at the older neural network literature for inspiration is probably a very good strategy :-)

To sum up, this is really nice work, and I'm looking forward to the NIPS 2016 oral presentation of it!

2

[link] Summary by Liew Jun Hao 9 years ago

#### Introduction
Most recent semantic segmentation algorithms rely (explicitly or implicitly) on FCN. However, the large receptive field and many pooling layers lead to low spatial resolution in the deep layers. On top of that, the lack of explicit pixelwise grouping mechanism often produces spatially fragmented and inconsistent results. In order to solve this, the authors proposed a Convolutional Random Walk Networks (RWNs) to diffuse the FCN potentials in a random walk fashion based on learned pixelwise affinities to enforce the spatial consistency of segmentation. One main contribution by the authors is that RWN needs only 131 additional parameters than the DeepLab architecture and yet outperform DeepLab by 1.5% on Pascal SBD dataset.

##### __1. Review of random graph walks__
In graph theory, an undirected graph is defined as $G=(V,E)$ where $V$ and $E$ are vertices and edges respectively. Then a random walk in a graph is characterized by the transition probabilities between vertices. Let $W$ be a $n \times n$ symmetric *affinity* matrix where $W_{ij}$ encodes the similarity of nodes $i$ and $j$ (usually with Gaussian affinities). Then, the random walk transition matrix, $A$ is defined as $A = D^{-1}W$ where $D$ is a $n \times n$ diagonal *degree* matrix. Let $y_t$ denotes the nodes distribution at time $t$, the distribution after one step of random walk process is $y_{t+1}=Ay_{t}$. The random walk process can be iterated until convergence.

##### __2. Overall architecture__
The overall architecture consists of 3 branches:
* semantic segmentation branch (which is FCN)
* pixel-level affinity branch (to learn affinities)
* random walk layer (diffuse FCN potentials based on learned affinities)
![RWN](http://i.imgur.com/au5PoY2.png)

##### __A) Semantic segmentation branch__
This authors employed DeepLab-LargeFOV FCN architecture as the semantic segmentation branch. As a result, the resolution of $fc8$ activation will be of 8 times lower than that of the original image. Let $f \in \mathbb{R}^{n \times n \times m}$ denote the $fc8$ activations where $n$ refers to height/ width of image and $m$ denotes the features dimension. 

##### __B) Pixelwise affinity branch__
Hand-crafted affinities are usually in the form of Gaussian, i.e. $\exp\frac{(x-y)^2}{\sigma^2}$ where $x$ and $y$ are usually pixel intensities while $\sigma$ control the smoothness. In this work, the authors argued that the learned affinities work better than the hand-crafted color affinities. Apart from RGB features, $conv1\texttt{_}1$ (64 dimensional) and $conv1\texttt{_}2$ (64 dimensional) are also employed to build the affinities. In particular, the 3 features are first downsampled by 8 times to match that of $fc8$ and concatenated to form a matrix of $n \times n \times k$ where $k=131$ (since 3+64+64=131). Then, the $L1$ pairwise distance is computed for __each__ dimension to form a __sparse__ matrix, $F \in \mathbb{R}^{n^2 \times n^2 \times 131}$ (the sparsity is due to the fact the distance is computed for pixel pairs within radius of $R$ only). A $1 \times 1 \times 1$ $conv$ is attached (dimension of kernel is therefore 131, which attributes to the only additional learned parameters in this work) followed by an $\exp$ layer, forming a sparse affinity matrix, $W \in \mathbb{R}^{n^2 \times n^2 \times 1}$. An Euclidean loss layer is attached to optimize w.r.t. the ground truth pixel affinities obtained from semantic segmentation annotations.

##### __C) Random walk layer__
The random walk layer diffuses the $fc8$ potentials from semantic segmentation branch using the learned pixelwise affinity $W$. First, the random walk transition matrix $A$ is computed by row-normalizing $W$. The diffused segmentation prediction is therefore $\hat{y}=A^tf$ to simulate $t$ random walk steps. The random walk layer is finally attached to a softmax layer (with cross-entropy loss) and trained end-to-end.

##### 3. Discussion
* Although RWN demonstrates the improvement of the coarse prediction, post-processing such as Dense-CRF or Graph Cuts is still required. 
* The authors showed that the learned affinity is better than the hand-crafted color affinities. This is probably due to the findings that $conv1\texttt{_}2$ features helped improving the prediction.
* The authors observed that a single random walk steps is the optimal.
* For the pixelwise affinity branches, only $conv1\texttt{_}1$, $conv1\texttt{_}2$ and RGB cues are used due to their same spatial dimension as the original image. Intuitively, only low level features are required to ensure that higher level features (from later layers) won't diffuse across boundaries (which is encoded in earlier layers).

#### Conclusion
The authors proposed a RWN that diffuses the higher level (more abstract) features based on __learned__ pixelwise affinities (lower level cues) in a random walk fashion.

1 Comments

2

[link] Summary by Liew Jun Hao 9 years ago

#### Introduction
This [paper](https://github.com/lucastheis/ride) introduces *recurrent image density estimator* (RIDE), a generative model by combining a *multidimensional* recurrent neural network with mixtures of experts to model the distribution of natural image. In this work, the authors used  *spatial* LSTMs (SLSTM) to capture the semantics in the form of hidden states where these hidden vectors are then fed into a factorized *factorized mixtures of conditional Gaussian scale mixtures* (MCGSMs) to predict the state of the corresponding pixels.

##### __1. Spatial long short-term memory (SLSTM)__
This is a straightforward extension of the multidimensional RNN in order to capture long range interaction. Let $\mathbf{x}$ be a grayscale image patch and $x_{ij}$ be the intensity of pixel at location ${ij}$. At each location $ij$, each LSTM unit perform the following operations:

$\mathbf{c}_{ij} = \mathbf{g}_{ij} \odot \mathbf{i}_{ij} + \mathbf{c}_{i,j-1} \odot \mathbf{f}^c_{ij} + \mathbf{c}_{i-1,j} \odot \mathbf{f}^r_{ij} $ 

$\mathbf{h}_{ij} = \tanh(\mathbf{c}_{ij} \odot \mathbf{o}_{ij})$

$\begin{pmatrix} 
\mathbf{g}_{ij} \\ \mathbf{o}_{ij} \\ \mathbf{i}_{ij} \\ \mathbf{g}_{ij}\\ \mathbf{f}_{ij}^r\\ \mathbf{f}_{ij}^c
\end{pmatrix} = \begin{pmatrix} \tanh \\ \sigma \\ \sigma \\ \sigma \\ \sigma\\ \sigma  \end{pmatrix} T_{\mathbf{A,b}}  \begin{pmatrix} \mathbf{x}_{<ij} \\ \mathbf{h}_{i,j-1} \\ \mathbf{h}_{i-1,j} \end{pmatrix} $

where $\mathbf{c}_{ij}$ and $\mathbf{h}_{ij}$ are memory units and hidden units respectively. Note that, there are 2 different forget gates $\mathbf{f}^c_{ij}$ and $\mathbf{f}^r_{ij}$ for the 2 preceding memory states $\mathbf{c}_{i,j-1}$ and $\mathbf{c}_{i-1,j}$. Also note that $\mathbf{x}_{<ij}$ here denotes a set of *causal neighborhood* by applying Markov assumption.

![ride_1](http://i.imgur.com/W8ugGvl.png)
As shown in Fig. C, although the prediction of a pixel depends only on its neighborhood (green) through feedforward connections, there is an indirect connection to a much larger region (red) via recurrent connections.

##### __2. Factorized mixtures of conditional Gaussian scale mixtures__
A generative model can usually be expressed as $p(\mathbf{x};\mathbf{\theta}) = \prod_{i,j} p(x_{ij}|\mathbf{x}_{<ij}; \mathbf{\theta})$ using chain rule. One way to improve the representational power of a model is to introduce different sets of parameters for each pixel, i.e. $p(\mathbf{x}; \{ \mathbf{\theta} \}) = \prod_{i,j} p(x_{ij}|\mathbf{x}_{<ij}; \mathbf{\theta}_{ij})$. However, untying shared parameters will lead to drastic increase of parameters. Therefore, the author applied 2 simple common used assumptions:
1. __Markov  assumption__: $\mathbf{x}_{<ij}$ is limited to small neighborhood around $x_{ij}$ (causal neighborhood)
2. __Stationary and shift invariance__: the same set of $\mathbf{\theta}_{ij}$ is used for every location ${ij}$ which corresponds to recurrent structure in RNN.

Therefore, the hidden vector from SLSTMs can be fed into the MCGSM to predict the state of corresponding label, i.e. $p(x_{ij} | \textbf{x}_{<ij}) = p(x_{ij} | \textbf{h}_{ij})$.

The conditional distribution distribution in MCGSM is represented as a mixture of experts:

$p(x_{ij} | \mathbf{x}_{<ij}; \mathbf{\theta}_{ij}) = \sum_{c,s} p(c, s | \mathbf{x}_{<ij}, \mathbf{\theta}_{ij}) p (x_{ij} | \mathbf{x}_{<ij}, c, s, \mathbf{\theta}_{ij})$.

where the first and second term correspond to gate and experts respectively. To further reduce the number of parameters, the authors proposed using a *factorized* MCGSM in order to use larger neighborhoods and more mixture components. (*__Remarks__: I am not too sure about the exact training of MCGSM, but as far as I understand, the MCGSM is firstly trained end-to-end with SLSTM using SGD with momentum and then finetuned using L-BFGS after each epoch by fixing the parameters of SLSTM.*)

* For training:

```
for n in range(num_epochs):
	for b in range(0, inputs.shape[0] - batch_size + 1, batch_size):
		# compute gradients
		f, df = f_df(params, b)

		loss.append(f / log(2.) / self.num_channels)

		# update SLSTM parameters
		for l in train_layers:
			for key in params['slstm'][l]:
				diff['slstm'][l][key] = momentum * diff['slstm'][l][key] - df['slstm'][l][key]
				params['slstm'][l][key] = params['slstm'][l][key] + learning_rate * diff['slstm'][l][key]

		# update MCGSM parameters
		diff['mcgsm'] = momentum * diff['mcgsm'] - df['mcgsm']
		params['mcgsm'] = params['mcgsm'] + learning_rate * diff['mcgsm']
```

* Finetuning (part of the code)

```
for l in range(self.num_layers):
	self.slstm[l] = SLSTM(
		num_rows=hiddens.shape[1],
		num_cols=hiddens.shape[2],
		num_channels=hiddens.shape[3],
		num_hiddens=self.num_hiddens,
		batch_size=min([hiddens.shape[0], self.MAX_BATCH_SIZE]),
		nonlinearity=self.nonlinearity,
		extended=self.extended,
		slstm=self.slstm[l],
		verbosity=self.verbosity)

	hiddens = self.slstm[l].forward(hiddens)

# finetune with early stopping based on validation performance
return self.mcgsm.train(
	hiddens_train, outputs_train,
	hiddens_valid, outputs_valid,
	parameters={
		'verbosity': self.verbosity,
		'train_means': train_means,
		'max_iter': max_iter})
```

3

[link] Summary by Shagun Sodhani 9 years ago

#### Introduction

* Method to visualize high-dimensional data points in 2/3 dimensional space.
* Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation.
* Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a non-linear manifold.
* t-SNE approach converts data into a matrix of pairwise similarities and visualizes this matrix.
* Based on SNE (Stochastic Neighbor Embedding)
* [Link to paper](http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)

#### SNE

* Given a set of datapoints $x_1, x_2, ...x_n, p_{i|j}$ is the probability that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$. Calculation of $\sigma_i$ is described later.
* Similarly, define $q_{i|j}$ as conditional probability corresponding to low-dimensional representations of $y_i$ and $y_j$ (corresponding to $x_i$ and $x_j$). The variance of Gaussian in this case is set to be $1/\sqrt{2}$
* Argument is that if $y_i$ and $y_j$ captures the similarity between $x_i$ and $x_j$, $p_{i|j}$ and $q_{i|j}$ should be equal. So objective function to be minimized is Kullback-Leibler (KL) Divergence measure for $P_i$ and $Q_i$, where $P_i$ ($Q_i$) represent conditional probability distribution given $x_i$ ($y_i$)
* Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure.
* Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find $\sigma_i$ which produces the $P_i$ having same perplexity as specified by the user.
* Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing.

#### t-SNE (t-Distributed SNE)

##### Symmetric SNE

* A single KL Divergence between P (joint probability distribution in high-dimensional space) and Q (joint probability distribution in low-dimensional space) is minimized.
* Symmetric because $p_{i|j} = p_{j|i}$ and $q_{i|j} = q_{j|i}$
* More robust to outliers and has a simpler gradient expression.

##### Crowding Problem

* When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters.
* One way around this problem is to use UNI-SNE but optimization of the cost function, in that case, is difficult.

##### t-SNE

* Instead of Gaussian, use a heavy-tailed distribution (like Student-t distribution) to convert distances into probability scores in low dimensions. This way moderate distance in high-dimensional space can be modeled by larger distance in low-dimensional space.
* Student-t distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast. 
* The cost function is easy to optimize.

##### Optimization Tricks

###### Early Compression

* At the start of optimization, force the datapoints (in low-dimensional space) to stay close together so that datapoints can easily move from one cluster to another. 
* Implemented an L2-penalty term proportional to the sum of the squared distance of datapoints from the origin.

###### Early Exaggeration

* Scale all the $p_{i|j}$'s so that large $q_{i|j}$*'s are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the low-dimensional space.

##### t-SNE on large datasets

* Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets.
* Select a random subset of points (called landmark points) to display.
* for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point.
* $p_{i|j}$ is defined as fraction of random walks starting at $x_i$ and finishing at $x_j$ (both these points are landmark points). This way, $p_{i|j}$ is not sensitive to "short-circuits" in the graph (due to noisy data points).

#### Advantages of t-SNE

* Gaussian kernel employed by t-SNE (in high-dimensional) defines a soft border between the local and global structure of the data.
* Both nearby and distant pair of datapoints get equal importance in modeling the low-dimensional coordinates.
* The local neighborhood size of each datapoint is determined on the basis of the local density of the data.
* Random walk version of t-SNE takes care of "short-circuit" problem.

#### Limitations of t-SNE

* It is unclear t-SNE would perform on general **Dimensionality Reduction** for more than 3 dimensions. For such higher (than 3) dimensions, Student-t distribution with more degrees of freedom should be more appropriate.
* t-SNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (**curse of dimensionality**).
* The cost function for t-SNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution.

5

[link] Summary by Shagun Sodhani 9 years ago

### Introduction

* *Curriculum Learning* - When training machine learning models, start with easier subtasks and gradually increase the difficulty level of the tasks.
* Motivation comes from the observation that humans and animals seem to learn better when trained with a curriculum like a strategy.
* [Link](http://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf) to the paper.

### Contributions of the paper

* Explore cases that show that curriculum learning benefits machine learning.
* Offer hypothesis around when and why does it happen.
* Explore relation of curriculum learning with other machine learning approaches.

### Experiments with convex criteria

* Training perceptron where some input data is irrelevant(not predictive of the target class).
* Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane.
* Curriculum learning model outperforms no-curriculum based approach.
* Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anti-curriculum strategy also outperforms no-curriculum strategy.

### Experiments on shape recognition with datasets having different variability in shapes

* Standard(target) dataset - Images of rectangles, ellipses, and triangles.
* Easy dataset - Images of squares, circles, and equilateral triangles.
* Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called *switch epoch*).
* For no-curriculum learning, the first epoch is the *switch epoch*.
* As *switch epoch* increases, the classification error comes down with the best performance when *switch epoch* is half the total number of epochs.
* Paper does not report results for higher values of *switch epoch*.

### Experiments on language modeling

* Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words.
* Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary.
* Each word from the vocabulary is embedded into a *d* dimensional feature space using a matrix **W** (to be learnt).
* The model predicts the score of next word, given a window of words.
* Expected value of ranking loss function is minimized to learn **W**.
* Curriculum Learning-based model overtakes the other model soon after switching to the target vocabulary, indicating that curriculum-based model quickly learns new words.

### Curriculum as a continuation method

* Continuation methods start with a smoothed objective function and gradually move to less smoothed function.
* Useful in the case where the objective function in non-convex.
* Consider a family of cost functions $C_\lambda (\theta)$ such that $C_0(\theta)$ can be easily optimized and $C_1(\theta)$ is the actual objective function.
* Start with $C_0 (\theta)$ and increase $\lambda$, keeping $\theta$ at a local minimum of $C_\lambda (\theta)$.
* Idea is to move $\theta$ towards a dominant (if not global) minima of $C_1(\theta)$.
* Curriculum learning can be seen as a sequence of training criteria starting with an easy-to-optimise objective and moving all the way to the actual objective.
* The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step).

### Advantages of Curriculum Learning

* Faster training in the online setting as learner does not try to learn difficult examples when it is not ready.
* Guiding training towards better local minima in parameter space, specifically useful for non-convex methods.

### Relation to other machine learning approaches

* **Unsupervised preprocessing** - Both have a regularizing effect and lower the generalization error for the same training error.
* **Active learning** - The learner would benefit most from the examples that are close to the learner's frontier of knowledge and are neither too hard nor too easy.
* **Boosting Algorithms** - Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change.
* **Transfer learning** and **Life-long learning** - Initial tasks are used to guide the optimisation problem.

### Criticism

* Curriculum Learning is not well understood, making it difficult to define the curriculum.
* In one of the examples, anti-curriculum performs better than no-curriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anti-curriculum to perform worse.