Visualizing Large-scale and High-dimensional Data
Jian Tang
and
Jingzhou Liu
and
Ming Zhang
and
Qiaozhu Mei
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.LG, cs.HC
First published: 2016/02/01 (8 years ago) Abstract: We study the problem of visualizing large-scale and high-dimensional data in
a low-dimensional (typically 2D or 3D) space. Much success has been reported
recently by techniques that first compute a similarity structure of the data
points and then project them into a low-dimensional space with the structure
preserved. These two steps suffer from considerable computational costs,
preventing the state-of-the-art methods such as the t-SNE from scaling to
large-scale and high-dimensional data (e.g., millions of data points and
hundreds of dimensions). We propose the LargeVis, a technique that first
constructs an accurately approximated K-nearest neighbor graph from the data
and then layouts the graph in the low-dimensional space. Comparing to t-SNE,
LargeVis significantly reduces the computational cost of the graph construction
step and employs a principled probabilistic model for the visualization step,
the objective of which can be effectively optimized through asynchronous
stochastic gradient descent with a linear time complexity. The whole procedure
thus easily scales to millions of high-dimensional data points. Experimental
results on real-world data sets demonstrate that the LargeVis outperforms the
state-of-the-art methods in both efficiency and effectiveness. The
hyper-parameters of LargeVis are also much more stable over different data
sets.
#### Introduction
* LargeVis - a technique to visualize large-scale and high-dimensional data in low-dimensional space.
* Problem relates to both information visualization and machine learning (and data mining) domain.
* [Link to the paper](https://arxiv.org/abs/1602.00370)
#### t-SNE
* State of the art method for visualization problem.
* Start by constructing a similarity structure from the data and then project the structure to 2/3 dimensional space.
* An accelerated version proposed that uses a K-nearest Neighbor (KNN) graph as the similarity structure.
* Limitations
* Computational cost of constructing KNN graph for high-dimensional data.
* Efficiency of graph visualization techniques breaks down for large data ($O(NlogN)$ complexity).
* Parameters are sensitive to the dataset.
#### LargeVis
* Constructs KNN graph (in a more efficient manner as compared to t-SNE).
* Uses a principled probabilistic model for graph visualization.
* Let us say the input is in the form of $N$ datapoints in $d$ dimensional space.
##### KNN Graph Construction
* Random projection tree used for nearest-neighborhood search in the high-dimensional space with euclidean distance as metric of distance.
* Tree is constructed by partitioning the entire space and the nodes in the tree correspond to subspaces.
* For every non-leaf node of the tree, select a random hyperplane that splits the subspace (corresponding to the leaf node) into two children.
* This is done till the number of nodes in the subspace reaches a threshold.
* Once the tree is constructed, each data point gets assigned a leaf node and points in the subspace of the leaf node are the candidate nearest neighbors of that datapoint.
* For high accuracy, a large number of trees should be constructed (which increases the computational cost).
* The paper counters this bottleneck by using the idea "a neighbor of my neighbor is also my neighbor" to increase the accuracy of the constructed graph.
* Basically construct an approximate KNN graph using random projection trees and then for each node, search its neighbor's neighbors as well.
* Edges are assigned weights just like in t-SNE.
##### Probabilistic Model For Graph Visualization
* Given a pair of vertices, the probability of observing an edge between them is given as a function of the distance between the projection of the pair of vertices in the lower dimensional space.
* The probability of observing an edge with weight $w$ is given as $w_{th}$ power of probability of observing an edge.
* Given a weighted graph, $G$, the likelihood of the graph can be modeled as the likelihood of observed edges plus the likelihood of negative edges (vertex pairs without edges).
* To simplify the objective function, some negative edges are sampled corresponding to each vertex using a noisy distribution.
* The edges are sampled with probability proportional to their weight and then treated as binary edges.
* The resulting objective function can be optimized using asynchronous stochastic gradient descent (very effective on sparse graphs).
* The overall complexity is $O(sMN)$, $s$ is the dimension of lower dimensional space, $M$ is the number of negative samples and $N$ is the number of vertices.
#### Experiments
* Data is first preprocessed with embedding learning techniques like SkipGram and LINE and brought down to 100-dimensional space.
* One limitation is that the data is preprocessed to reduce the number of dimensions to 100. This seems to go against the claim of scaling for hundreds of dimensions.
* KNN construction is fastest for LargeVis followed by random projection trees, NN Descent, and vantage point trees (used in t-SNE).
* LargeVis requires very few iterations to create highly accurate KNN graphs.
* KNN Classifier is used to measure the accuracy of graph visualization quantitatively.
* Compared to t-SNE, LargeVis is much more stable with respect to learning rate across datasets.
* LargeVis benefits from its linear complexity (as compared to t-SNE's log linear complexity) and for default learning rate, outperforms t-SNE for larger datasets.