Visualizing data using t-SNE on ShortScience.org

www.jmlr.org
scholar.google.com
Visualizing data using t-SNE
Van der Maaten, Laurens and Hinton, Geoffrey
Journal of Machine Learning Research - 2008 via Local Bibsonomy
Keywords: visualization, clustering, machine_learning, t-SNE, PCA
Summaries/Notes 1
[link] Summary by Shagun Sodhani 8 years ago
#### Introduction

* Method to visualize high-dimensional data points in 2/3 dimensional space.
* Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation.
* Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a non-linear manifold.
* t-SNE approach converts data into a matrix of pairwise similarities and visualizes this matrix.
* Based on SNE (Stochastic Neighbor Embedding)
* [Link to paper](http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)

#### SNE

* Given a set of datapoints $x_1, x_2, ...x_n, p_{i|j}$ is the probability that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$. Calculation of $\sigma_i$ is described later.
* Similarly, define $q_{i|j}$ as conditional probability corresponding to low-dimensional representations of $y_i$ and $y_j$ (corresponding to $x_i$ and $x_j$). The variance of Gaussian in this case is set to be $1/\sqrt{2}$
* Argument is that if $y_i$ and $y_j$ captures the similarity between $x_i$ and $x_j$, $p_{i|j}$ and $q_{i|j}$ should be equal. So objective function to be minimized is Kullback-Leibler (KL) Divergence measure for $P_i$ and $Q_i$, where $P_i$ ($Q_i$) represent conditional probability distribution given $x_i$ ($y_i$)
* Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure.
* Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find $\sigma_i$ which produces the $P_i$ having same perplexity as specified by the user.
* Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing.

#### t-SNE (t-Distributed SNE)

##### Symmetric SNE

* A single KL Divergence between P (joint probability distribution in high-dimensional space) and Q (joint probability distribution in low-dimensional space) is minimized.
* Symmetric because $p_{i|j} = p_{j|i}$ and $q_{i|j} = q_{j|i}$
* More robust to outliers and has a simpler gradient expression.

##### Crowding Problem

* When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters.
* One way around this problem is to use UNI-SNE but optimization of the cost function, in that case, is difficult.

##### t-SNE

* Instead of Gaussian, use a heavy-tailed distribution (like Student-t distribution) to convert distances into probability scores in low dimensions. This way moderate distance in high-dimensional space can be modeled by larger distance in low-dimensional space.
* Student-t distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast. 
* The cost function is easy to optimize.

##### Optimization Tricks

###### Early Compression

* At the start of optimization, force the datapoints (in low-dimensional space) to stay close together so that datapoints can easily move from one cluster to another. 
* Implemented an L2-penalty term proportional to the sum of the squared distance of datapoints from the origin.

###### Early Exaggeration

* Scale all the $p_{i|j}$'s so that large $q_{i|j}$*'s are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the low-dimensional space.

##### t-SNE on large datasets

* Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets.
* Select a random subset of points (called landmark points) to display.
* for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point.
* $p_{i|j}$ is defined as fraction of random walks starting at $x_i$ and finishing at $x_j$ (both these points are landmark points). This way, $p_{i|j}$ is not sensitive to "short-circuits" in the graph (due to noisy data points).

#### Advantages of t-SNE

* Gaussian kernel employed by t-SNE (in high-dimensional) defines a soft border between the local and global structure of the data.
* Both nearby and distant pair of datapoints get equal importance in modeling the low-dimensional coordinates.
* The local neighborhood size of each datapoint is determined on the basis of the local density of the data.
* Random walk version of t-SNE takes care of "short-circuit" problem.

#### Limitations of t-SNE

* It is unclear t-SNE would perform on general **Dimensionality Reduction** for more than 3 dimensions. For such higher (than 3) dimensions, Student-t distribution with more degrees of freedom should be more appropriate.
* t-SNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (**curse of dimensionality**).
* The cost function for t-SNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution.
Your comment:
Write your summary here (You can use $\LaTeX$ and markdown syntax):
Anon Private