Visualizing Large-scale and High-dimensional Data on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Visualizing Large-scale and High-dimensional Data
Jian Tang and Jingzhou Liu and Ming Zhang and Qiaozhu Mei
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.HC
more

Summaries/Notes 1

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* LargeVis - a technique to visualize large-scale and high-dimensional data in low-dimensional space.
* Problem relates to both information visualization and machine learning (and data mining) domain.
* [Link to the paper](https://arxiv.org/abs/1602.00370)

#### t-SNE

* State of the art method for visualization problem.
* Start by constructing a similarity structure from the data and then project the structure to 2/3 dimensional space.
* An accelerated version proposed that uses a K-nearest Neighbor (KNN) graph as the similarity structure.
* Limitations
* Computational cost of constructing KNN graph for high-dimensional data.
* Efficiency of graph visualization techniques breaks down for large data ($O(NlogN)$ complexity).
* Parameters are sensitive to the dataset.

#### LargeVis

* Constructs KNN graph (in a more efficient manner as compared to t-SNE).
* Uses a principled probabilistic model for graph visualization.
* Let us say the input is in the form of $N$ datapoints in $d$ dimensional space.

##### KNN Graph Construction

* Random projection tree used for nearest-neighborhood search in the high-dimensional space with euclidean distance as metric of distance.
* Tree is constructed by partitioning the entire space and the nodes in the tree correspond to subspaces.
* For every non-leaf node of the tree, select a random hyperplane that splits the subspace (corresponding to the leaf node) into two children.
* This is done till the number of nodes in the subspace reaches a threshold.
* Once the tree is constructed, each data point gets assigned a leaf node and points in the subspace of the leaf node are the candidate nearest neighbors of that datapoint.
* For high accuracy, a large number of trees should be constructed (which increases the computational cost).
* The paper counters this bottleneck by using the idea "a neighbor of my neighbor is also my neighbor" to increase the accuracy of the constructed graph.
* Basically construct an approximate KNN graph using random projection trees and then for each node, search its neighbor's neighbors as well.
* Edges are assigned weights just like in t-SNE.

##### Probabilistic Model For Graph Visualization

* Given a pair of vertices, the probability of observing an edge between them is given as a function of the distance between the projection of the pair of vertices in the lower dimensional space.
* The probability of observing an edge with weight $w$ is given as $w_{th}$ power of probability of observing an edge.
* Given a weighted graph, $G$, the likelihood of the graph can be modeled as the likelihood of observed edges plus the likelihood of negative edges (vertex pairs without edges).
* To simplify the objective function, some negative edges are sampled corresponding to each vertex using a noisy distribution.
* The edges are sampled with probability proportional to their weight and then treated as binary edges.
* The resulting objective function can be optimized using asynchronous stochastic gradient descent (very effective on sparse graphs).
* The overall complexity is $O(sMN)$, $s$ is the dimension of lower dimensional space, $M$ is the number of negative samples and $N$ is the number of vertices.

#### Experiments

* Data is first preprocessed with embedding learning techniques like SkipGram and LINE and brought down to 100-dimensional space.
* One limitation is that the data is preprocessed to reduce the number of dimensions to 100. This seems to go against the claim of scaling for hundreds of dimensions.
* KNN construction is fastest for LargeVis followed by random projection trees, NN Descent, and vantage point trees (used in t-SNE).
* LargeVis requires very few iterations to create highly accurate KNN graphs.
* KNN Classifier is used to measure the accuracy of graph visualization quantitatively.
* Compared to t-SNE, LargeVis is much more stable with respect to learning rate across datasets.
* LargeVis benefits from its linear complexity (as compared to t-SNE's log linear complexity) and for default learning rate, outperforms t-SNE for larger datasets.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private