Welcome to ShortScience.org! |
[link]
This paper presents a method to extract motion (dynamic) and skeleton / camera-view (static) representations from the video of a person represented as a 2D joints skeleton. This decomposition allows transferring the motion to different skeletons (retargeting) and many more. It does so by utilizing deep neural networks. https://i.imgur.com/J5jBzcs.png The architecture consists of motion and skeleton / camera-view encoders that decompose an input sequence of 2D joint positions into latent spaces and a decoder that reconstructs a sequence from such components. The motion vector varies in length, while skeleton and camera view representations are fixed. https://i.imgur.com/QaDksg1.png This is achieved by the nature of the network design. Specifically, motion encoder uses 1D convolutions with strides, thus output dimensions are proportionally related to the input. On the other hand, the static encoder uses global average pooling in the final layer to produce a fixed-size latent representation: https://i.imgur.com/Cf7TVKA.png More detailed design of the encoders and decoder is shown below: https://i.imgur.com/cpaveFm.png **Dataset**. Adobe Mixamo is used to obtain sequences of poses of different 3D characters. It allows creating multiple samples where different characters (with different skeleton structure) perform the same motions. These 3D video clips are then projected into 2D by selecting arbitrary view angles and distance to the object. Thus, we can easily create multiple pairs of 2D image sequences of characters (same or different) performing various actions (same or different) from various views. **Loss functions** used to for training (refer the paper for the detailed formulas): - *Cross Reconstruction Loss* It is a sum of two other losses. The first one is the reconstruction loss where the network tries to reconstruct original input. The second one is cross reconstruction loss where the network tries to reconstruct the sequence where a different character performs the exact same action as the input. It is best shown in the Figure below: https://i.imgur.com/ewZOAox.png - *Triplet Loss* This loss aims to bring latent spaces of similar motions closer together, while separate apart the ones that are different. It takes two triplets, where each contains two samples that share the same (or very similar) motion and one with different. The same concept is applied to the static latent space. - Foot velocity loss This loss helps to remove the foot skating phenomenon - hands and feet exhibit larger errors that the other keypoints. https://i.imgur.com/DclJEde.png where $V_{global}$ and $V_{joint_n}$ extract the global and local ($n$th joint) velocities from the reconstructed output $\hat{p}_{ij}$, respectively, and map them back to the image units, and $V_{orig_n}$ returns the original global velocity of the $n$th joint from the ground truth, $p_{ij}$ **Normalization** - subtract the root position from all joint locations in every frame - subtract the mean joint position and divide by the standard deviation (averaged over the entire dataset) - per-frame global velocity is not touched **Data Augmentation** applied during training: - temporal clipping during the batch creation process - scaling - same as to use different camera distance to the object - flipping symmetrical joints - dropping joints to simulate behavior of a real keypoint detector as they often miss some joints - adding real video data to the training and use reprojection loss in case no labels are given **Results and Evaluation** (to be continued) ... While the summary becomes too long to be a called a summary it is worth mentioning that there are several applications possible with this approach: - performance cloning - make any 2D skeleton repeat particular motions - motion retrieval - search videos that contain the particular target motion |
[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} -0.656 \\\ -0.652 \\\ -0.379 \end{array}\right], H = \left[\begin{array}{c c c} -6.48 & -6.26 & -3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft |
[link]
This recent paper, a collaboration involving some of the authors of MAML, proposes an intriguing application of techniques developed in the field of meta learning to the problem of unsupervised learning - specifically, the problem of developing representations without labeled data, which can then be used to learn quickly from a small amount of labeled data. As a reminder, the idea behind meta learning is that you train models on multiple different tasks, using only a small amount of data from each task, and update the model based on the test set performance of the model. The conceptual advance proposed by this paper is to adopt the broad strokes of the meta learning framework, but apply it to unsupervised data, i.e. data with no pre-defined supervised tasks. The goal of such a project is, as so often is the case with unsupervised learning, to learn representations, specifically, representations we believe might be useful over a whole distribution of supervised tasks. However, to apply traditional meta learning techniques, we need that aforementioned distribution of tasks, and we’ve defined our problem as being over unsupervised data. How exactly are we supposed to construct the former out of the latter? This may seem a little circular, or strange, or definitionally impossible: how can we generate supervised tasks without supervised labels? https://i.imgur.com/YaU1y1k.png The artificial tasks created by this paper are rooted in mechanically straightforward operations, but conceptually interesting ones all the same: it uses an off the shelf unsupervised learning algorithm to generate a fixed-width vector embedding of your input data (say, images), and then generates multiple different clusterings of the embedded data, and then uses those cluster IDs as labels in a faux-supervised task. It manages to get multiple different tasks, rather than just one - remember, the premise of meta learning is in models learned over multiple tasks - by randomly up and down-scaling dimensions of the embedding before clustering is applied. Different scalings of dimensions means different points close to one another, which means the partition of the dataset into different clusters. With this distribution of “supervised” tasks in hand, the paper simply applies previously proposed meta learning techniques - like MAML, which learns a model which can be quickly fine tuned on a new task, or prototypical networks, which learn an embedding space in which observations from the same class, across many possible class definitions are close to one another. https://i.imgur.com/BRcg6n7.png An interesting note from the evaluation is that this method - which is somewhat amusingly dubbed “CACTUs” - performs best relative to alternative baselines in cases where the true underlying class distribution on which the model is meta-trained is the most different from the underlying class distribution on which the model is tested. Intuitively, this makes reasonable sense: meta learning is designed to trade off knowledge of any given specific task against the flexibility to be performant on a new class division, and so it gets the most value from trade off where a genuinely dissimilar class split is seen during testing. One other quick thing I’d like to note is the set of implicit assumptions this model builds on, in the way it creates its unsupervised tasks. First, it leverages the smoothness assumptions of classes - that is, it assumes that the kinds of classes we might want our model to eventually perform on are close together, in some idealized conceptual space. While not a perfect assumption (there’s a reason we don’t use KNN over embeddings for all of our ML tasks) it does have a general reasonableness behind it, since rarely are the kinds of classes very conceptually heterogenous. Second, it assumes that a truly unsupervised learning method can learn a representation that, despite being itself sub-optimal as a basis for supervised tasks, is a well-enough designed feature space for the general heuristic of “nearby things are likely of the same class” to at least approximately hold. I find this set of assumptions interesting because they are so simplifying that it’s a bit of a surprise that they actually work: even if the “classes” we meta-train our model on are defined with simple Euclidean rules, optimizing to be able to perform that separation using little data does indeed seem to transfer to the general problem of “separating real world, messier-in-embedding-space classes using little data”. |
[link]
A common critique of deep learning is its brittleness off-distribution, combined with its tendency to give confident predictions for off-distribution inputs, as is seen in the case of adversarial examples. In response to this critique, a number of different methods have cropped up in recent years, that try to capture a model's uncertainty as well as its overall prediction. This paper tries to do a broad evaluation of uncertainty methods, and, particularly, to test how they perform on out of distribution data, including both data that is perturbed from its original values, and fully OOD data from ground-truth categories never seen during training. Ideally, we would want an uncertainty method that is less confident in its predictions as data is made more dissimilar from the distribution that the model is trained on. Some metrics the paper uses for capturing this are: - Brier Score (The difference between predicted score and ground truth 0/1 label, averaged over all examples) - Negative Log Likelihood - Expected Calibration Error (Within a given bucket, this is calculated as the difference between accuracy to ground truth labels, and the average predicted score in that bucket, capturing that you'd ideally want to have a lower predicted score in cases where you have low accuracy, and vice versa) - Entropy - For labels that are fully out of distribution, and don't map to any of the model's categories, you can't directly calculate ground truth accuracy, but you can ideally ask for a model that has high entropy (close to uniform) probabilities over the classes it knows about when the image is drawn from an entirely different class The authors test over image datasets small (MNIST) and large (ImageNet and CIFAR10), as well as a categorical ad-click-prediction dataset. They came up with some interesting findings. https://i.imgur.com/EVnjS1R.png 1. More fully principled Bayesian estimation of posteriors over parameters, in the form of Stochastic Variational Inference, works well on MNIST, but quite poorly on either categorical data or higher dimensional image datasets https://i.imgur.com/3emTYNP.png 2. Temperature scaling, which basically performs a second supervised calibration using a hold-out set to push your probabilities towards true probabilities, performs well in-distribution but collapses fairly quickly off-distribution (which sort of makes sense given that it too is just another supervised method that can do poorly when off-distribution) 3. In general, ensemble methods, where you train different models on different subsets of the data and take their variance as uncertainty, perform the best across the bigger image models as well as the ad click model, likely because SVI (along with many other Bayesian methods) is too computationally intensive to get to work well on higher-dimensional data 4. Overall, none of the methods worked particularly well, and even the best-performing ones were often confidently wrong off-distribution I think it's fair to say that we're far from where we wish we were when it comes to models that "know when they don't know," and this paper does a good job of highlighting that in specific fashion.
1 Comments
|
[link]
In this paper, the authors raise a very important point for instance based image retrieval. For a task like an image recognition features extracted from higher layer of deep networks works really well in general, but for task like instance based image retrieval features extracted from higher layers don't prove to be that useful, so the authors suggest that we take features from lower layer and on those features, apply [VLAD encoding](https://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13/arandjelovic13.pdf). On top of the VLAD encoding as part of post processing, we perform steps like intra-normalisation and then apply PCA and reduce the encoding to a size of 128 Dimension. The authors have performed their experiments using [Googlenet](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf) and [VGG-16](https://arxiv.org/pdf/1409.1556v6.pdf), and they tried Inception 3a, Inception 4a and Inception 4e on GoogleNet and conv4_2, conv5_1 and conv5_2 on VGG-16. The above mentioned layers has almost similar performance on the dataset they have used. The performance metric used by the authors is Mean Average Precision(MAP). |