[link]
#### Introduction * Method to visualize high-dimensional data points in 2/3 dimensional space. * Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation. * Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a non-linear manifold. * t-SNE approach converts data into a matrix of pairwise similarities and visualizes this matrix. * Based on SNE (Stochastic Neighbor Embedding) * [Link to paper](http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) #### SNE * Given a set of datapoints $x_1, x_2, ...x_n, p_{i|j}$ is the probability that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$. Calculation of $\sigma_i$ is described later. * Similarly, define $q_{i|j}$ as conditional probability corresponding to low-dimensional representations of $y_i$ and $y_j$ (corresponding to $x_i$ and $x_j$). The variance of Gaussian in this case is set to be $1/\sqrt{2}$ * Argument is that if $y_i$ and $y_j$ captures the similarity between $x_i$ and $x_j$, $p_{i|j}$ and $q_{i|j}$ should be equal. So objective function to be minimized is Kullback-Leibler (KL) Divergence measure for $P_i$ and $Q_i$, where $P_i$ ($Q_i$) represent conditional probability distribution given $x_i$ ($y_i$) * Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure. * Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find $\sigma_i$ which produces the $P_i$ having same perplexity as specified by the user. * Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing. #### t-SNE (t-Distributed SNE) ##### Symmetric SNE * A single KL Divergence between P (joint probability distribution in high-dimensional space) and Q (joint probability distribution in low-dimensional space) is minimized. * Symmetric because $p_{i|j} = p_{j|i}$ and $q_{i|j} = q_{j|i}$ * More robust to outliers and has a simpler gradient expression. ##### Crowding Problem * When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters. * One way around this problem is to use UNI-SNE but optimization of the cost function, in that case, is difficult. ##### t-SNE * Instead of Gaussian, use a heavy-tailed distribution (like Student-t distribution) to convert distances into probability scores in low dimensions. This way moderate distance in high-dimensional space can be modeled by larger distance in low-dimensional space. * Student-t distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast. * The cost function is easy to optimize. ##### Optimization Tricks ###### Early Compression * At the start of optimization, force the datapoints (in low-dimensional space) to stay close together so that datapoints can easily move from one cluster to another. * Implemented an L2-penalty term proportional to the sum of the squared distance of datapoints from the origin. ###### Early Exaggeration * Scale all the $p_{i|j}$'s so that large $q_{i|j}$*'s are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the low-dimensional space. ##### t-SNE on large datasets * Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets. * Select a random subset of points (called landmark points) to display. * for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point. * $p_{i|j}$ is defined as fraction of random walks starting at $x_i$ and finishing at $x_j$ (both these points are landmark points). This way, $p_{i|j}$ is not sensitive to "short-circuits" in the graph (due to noisy data points). #### Advantages of t-SNE * Gaussian kernel employed by t-SNE (in high-dimensional) defines a soft border between the local and global structure of the data. * Both nearby and distant pair of datapoints get equal importance in modeling the low-dimensional coordinates. * The local neighborhood size of each datapoint is determined on the basis of the local density of the data. * Random walk version of t-SNE takes care of "short-circuit" problem. #### Limitations of t-SNE * It is unclear t-SNE would perform on general **Dimensionality Reduction** for more than 3 dimensions. For such higher (than 3) dimensions, Student-t distribution with more degrees of freedom should be more appropriate. * t-SNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (**curse of dimensionality**). * The cost function for t-SNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution.
Your comment:
|