Welcome to ShortScience.org! 
[link]
This paper proposes a 3D human pose estimation in video method based on the dilated temporal convolutions applied on 2D keypoints (input to the network). 2D keypoints can be obtained using any person keypoint detector, but Mask RCNN with ResNet101 backbone, pretrained on COCO and finetuned on 2D projections from Human3.6M, is used in the paper. https://i.imgur.com/CdQONiN.png The poses are presented as 2D keypoint coordinates in contrast to using heatmaps (i.e. Gaussian operation applied at the keypoint 2D location). Thus, 1D convolutions over the time series are applied, instead of 2D convolutions over heatmaps. The model is a fully convolutional architecture with residual connections that takes a sequence of 2D poses ( concatenated $(x,y)$ coordinates of the joints in each frame) as input and transforms them through temporal convolutions. https://i.imgur.com/tCZvt6M.png The `Slice` layer in the residual connection performs padding (or slicing) the sequence with replicas of boundary frames (to both left and right) to match the dimensions with the main block as zeropadding is not used in the convolution operations. 3D pose estimation is a difficult task particularly due to the limited data available online. Therefore, the authors propose semisupervised approach of training the 2D>3D pose estimation by exploiting unlabeled video. Specifically, 2D keypoints are detected in the unlabeled video with any keypoint detector, then 3D keypoints are predicted from them and these 3D points are reprojected back to 2D (camera intrinsic parameters are required). This is idea similar to cycle consistency in the [CycleGAN](https://junyanz.github.io/CycleGAN/), for instance. https://i.imgur.com/CBHxFOd.png In the semisupervised part (bottom part of the image above) training penalizes when the reprojected 2D keypoints are far from the original input. Weighted mean perjoint position error (WMPJPE) loss, weighted by the inverse of the depth to the object (since far objects should contribute less to the training than close ones) is used as the optimization goal. The two networks (`supervised` above, `semisupervised` below) have the same architecture but do not share any weights. They are jointly optimized where `semisupervised` part serves as a regularizer. They communicate through the path aiming to make sure that the mean bone length of the above and below branches match. The interesting tendency is observed from the MPJPE analysis with different amounts of supervised and unsupervised data available. Basically, the `semisupervised` approach becomes more effective when less labeled data is available. https://i.imgur.com/bHpVcSi.png Additionally, the error is reduced when the ground truth keypoints are used. This means that a robust and accurate 2D keypoint detector is essential for the accurate 3D pose estimation in this setting. https://i.imgur.com/rhhTDfo.png 
[link]
Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server. Historically, the default way to do Federated Learning was with an algorithm called FedSGD, which worked by:  Sending a copy of the current model to each device/client  Calculating a gradient update to be applied on top of that current model given a batch of data sampled from the client's device  Sending that gradient back to the central server  Averaging those gradients and applying them all at once to a central model The authors note that this approach is equivalent to one where a single device performs a step of gradient descent locally, sends the resulting *model* back to the the central server, and performs model averaging by averaging the parameter vectors there. Given that, and given their observation that, in federated learning, communication of gradients and models is generally much more costly than the computation itself (since the computation happens across so many machines), they ask whether the communication required to get to a certain accuracy could be better optimized by performing multiple steps of gradient calculation and update on a given device, before sending the resulting model back to a central server to be average with other clients models. Specifically, their algorithm, FedAvg, works by:  Dividing the data on a given device into batches of size B  Calculating an update on each batch and applying them sequentially to the starting model sent over the wire from the server  Repeating this for E epochs Conceptually, this should work perfectly well in the world where data from each batch is IID  independently drawn from the same distribution. But that is especially unlikely to be true in the case of federated learning, when a given user and device might have very specialized parts of the data space, and prior work has shown that there exist pathological cases where averaged models can perform worse than either model independently, even *when* the IID condition is met. The authors experiment empirically ask the question whether these sorts of pathological cases arise when simulating a federated learning procedure over MNIST and a language model trained on Shakespeare, trying over a range of hyperparameters (specifically B and E), and testing the case where data is heavily nonIID (in their case: where different "devices" had nonoverlapping sets of digits). https://i.imgur.com/xq9vi8S.png They show that, in both the IID and nonIID settings, they are able to reach their target accuracy, and are able to do so with many fewer rounds of communciation than are required by FedSGD (where an update is sent over the wire, and a model sent back, for each round of calculation done on the device.) The authors argue that this shows the practical usefulness of a Federated Learning approach that does more computation on individual devices before updating, even in the face of theoretical pathological cases. 
[link]
Largescale transformers on unsupervised text data have been wildly successful in recent years; arguably, the most successful single idea in the last ~3 years of machine learning. Given that, it's understandable that different domains within ML want to take their shot at seeing whether the same formula will work for them as well. This paper applies the principles of (1) transformers and (2) largescale unlabeled data to the problem of learning informative embeddings of molecular graphs. Labeling is a problem in much of machine learning  it's costly, and narrowly defined in terms of a certain task  but that problem is even more exacerbated when it comes to labeling properties of molecules, since they typically require wetlab chemistry to empirically measure. Given that, and also given the fact that we often want to predict new properties  like effectiveness against a new targetable drug receptor  that we don't yet have data for, finding a way to learn and transfer from unsupervised data has the potential to be quite valuable in the molecular learning sphere. There are two main conceptual parts to this paper and its method  named GROVER, in truetoMLform tortured acronym style. The first is the actual architecture of their model itself, which combines both a messagepassing Graph Neural Network to aggregate local information, and a Transformer to aggregate global information. The paper was a bit vague here, but the way I understand it is: https://i.imgur.com/JY4vRdd.png  There are parallel GNN + Transformer stacks for both edges and nodes, each of which outputs both a node and edge embedding, for four embeddings total. I'll describe the one for nodes, and the parallel for edges operates the same way, except that hidden states live on edges rather than nodes, and attention is conducted over edges rather than nodes  In the NodeTransformer version, a message passing NN (of I'm not sure how many layers) performs neighborhood aggregation (aggregating the hidden states of neighboring nodes and edges, then weighttransforming them, then aggregating again) until each node has a representation that has "absorbed" in information from a few hops out of its surrounding neighborhood. My understanding is that there is a separate MPNN for queries, keys, and values, and so each nodes end up with three different vectors for these three things.  Multiheaded attention is then performed over these node representations, in the normal way, where all keys and queries are dotproducted together, and put into a softmax to calculate a weighted average over the values  We now have nodelevel representations that combine both local and global information. These node representations are then aggregated into both node and edge representations, and each is put into a MLP layer and Layer Norm before finally outputting a nodebased node and edge representation. This is then joined by an edgebased node and edge representation from the parallel stack. These are aggregated on a fullgraph level to predict graphlevel properties https://i.imgur.com/NNl6v4Y.png The other component of the GROVER model is the way this architecture is actually trained  without explicit supervised labels. The authors use two tasks  one local, and one global. The local task constructs labels based on local contextual properties of a given atom  for example, the atom here has one doublebonded Nitrogen and one singlebonded Oxygen in its local environment  and tries to predict those labels given the representations of that atom (or node). The global task uses RDKit (an analytically constructed molecular analysis kit) to identify 85 different modifs or functional groups in the molecule, and encodes those into an 85long onehot vector that is being predicted on a graph level. https://i.imgur.com/jzbYchA.png With these two components, GROVER is pretrained on 10 million unlabeled molecules, and then evaluated in transfer settings where its representations are finetuned on small amounts of labeled data. The results are pretty impressive  it achieves new SOTA performance by relatively large amounts on all tasks, even relative to exist semisupervised pretraining methods that similarly have access to more data. The authors perform ablations to show that it's important to do the graphaggregation step before a transformer (the alternative being just doing a transformer on raw node and edge features), and also show that their architecture without pretraining (just used directly in downstream tasks) also performs worse. One thing I wish they'd directly ablated was the valueadd of the local (also referred to as "contextual") and global semisupervised tasks. Naively, I'd guess that most of the performance gain came from the global task, but it's hard to know without them having done the test directly. 
[link]
[code](https://github.com/openai/improvedgan), [demo](http://infinitechamber35121.herokuapp.com/cifarminibatch/1/?), [related](http://www.inference.vc/understandingminibatchdiscriminationingans/) ### Feature matching problem: overtraining on the current discriminator solution: ￼$E_{x \sim p_{\text{data}}}f(x)  E_{z \sim p_{z}(z)}f(G(z))_{2}^{2}$ were f(x) activations intermediate layer in discriminator ### Minibatch discrimination problem: generator to collapse to a single point solution: for each sample i, concatenate to $f(x_i)$ features $b$ measuring its distance to other samples j (i and j are both real or generated samples in same batch): $\sum_j \exp(M_{i, b}  M_{j, b}_{L_1})$ ￼ this generates visually appealing samples very quickly ### Historical averaging problem: SGD fails by going into extended orbits solution: parameters revert to the mean $ \theta  \frac{1}{t} \sum_{i=1}^t \theta[i] ^2$ ￼ ### Onesided label smoothing problem: discriminator vulnerability to adversarial examples solution: discriminator target for positive samples is 0.9 instead of 1 ### Virtual batch normalization problem: using BN cause output of examples in batch to be dependent solution: use reference batch chosen once at start of training and each sample is normalized using itself and the reference. It's expensive so used only on generation ### Assessment of image quality problem: MTurk not reliable solution: use inception model p(yx) to compute ￼$\exp(\mathbb{E}_x \text{KL}(p(y  x)  p(y)))$ on 50K generated images x ### Semisupervised learning use the discriminator to also classify on K labels when known and use all real samples (labels and unlabeled) in the discrimination task ￼$D(x) = \frac{Z(x)}{Z(x) + 1}, \text{ where } Z(x) = \sum_{k=1}^{K} \exp[l_k(x)]$. In this case use feature matching but not minibatch discrimination. It also improves the quality of generated images.
3 Comments

[link]
In this tutorial paper, Carl E. Rasmussen gives an introduction to Gaussian Process Regression focusing on the definition, the hyperparameter learning and future research directions. A Gaussian Process is completely defined by its mean function $m(\pmb{x})$ and its covariance function (kernel) $k(\pmb{x},\pmb{x}')$. The mean function $m(\pmb{x})$ corresponds to the mean vector $\pmb{\mu}$ of a Gaussian distribution whereas the covariance function $k(\pmb{x}, \pmb{x}')$ corresponds to the covariance matrix $\pmb{\Sigma}$. Thus, a Gaussian Process $f \sim \mathcal{GP}\left(m(\pmb{x}), k(\pmb{x}, \pmb{x}')\right)$ is a generalization of a Gaussian distribution over vectors to a distribution over functions. A random function vector $\pmb{\mathrm{f}}$ can be generated by a Gaussian Process through the following procedure: 1. Compute the components $\mu_i$ of the mean vector $\pmb{\mu}$ for each input $\pmb{x}_i$ using the mean function $m(\pmb{x})$ 2. Compute the components $\Sigma_{ij}$ of the covariance matrix $\pmb{\Sigma}$ using the covariance function $k(\pmb{x}, \pmb{x}')$ 3. A function vector $\pmb{\mathrm{f}} = [f(\pmb{x}_1), \dots, f(\pmb{x}_n)]^T$ can be drawn from the Gaussian distribution $\pmb{\mathrm{f}} \sim \mathcal{N}\left(\pmb{\mu}, \pmb{\Sigma} \right)$ Applying this procedure to regression, means that the resulting function vector $\pmb{\mathrm{f}}$ shall be drawn in a way that a function vector $\pmb{\mathrm{f}}$ is rejected if it does not comply with the training data $\mathcal{D}$. This is achieved by conditioning the distribution on the training data $\mathcal{D}$ yielding the posterior Gaussian Process $f \rvert \mathcal{D} \sim \mathcal{GP}(m_D(\pmb{x}), k_D(\pmb{x},\pmb{x}'))$ for noisefree observations with the posterior mean function $m_D(\pmb{x}) = m(\pmb{x}) + \pmb{\Sigma}(\pmb{X},\pmb{x})^T \pmb{\Sigma}^{1}(\pmb{\mathrm{f}}  \pmb{\mathrm{m}})$ and the posterior covariance function $k_D(\pmb{x},\pmb{x}')=k(\pmb{x},\pmb{x}')  \pmb{\Sigma}(\pmb{X}, \pmb{x}')$ with $\pmb{\Sigma}(\pmb{X},\pmb{x})$ being a vector of covariances between every training case of $\pmb{X}$ and $\pmb{x}$. Noisy observations $y(\pmb{x}) = f(\pmb{x}) + \epsilon$ with $\epsilon \sim \mathcal{N}(0,\sigma_n^2)$ can be taken into account with a second Gaussian Process with mean $m$ and covariance function $k$ resulting in $f \sim \mathcal{GP}(m,k)$ and $y \sim \mathcal{GP}(m, k + \sigma_n^2\delta_{ii'})$. The figure illustrates the cases of noisy observations (variance at training points) and of noisefree observationshttps://i.imgur.com/BWvsB7T.png (no variance at training points). In the Machine Learning perspective, the mean and the covariance function are parametrised by hyperparameters and provide thus a way to include prior knowledge e.g. knowing that the mean function is a second order polynomial. To find the optimal hyperparameters $\pmb{\theta}$, 1. determine the log marginal likelihood $L= \mathrm{log}(p(\pmb{y} \rvert \pmb{x}, \pmb{\theta}))$, 2. take the first partial derivatives of $L$ w.r.t. the hyperparameters, and 3. apply an optimization algorithm. It should be noted that a regularization term is not necessary for the log marginal likelihood $L$ because it already contains a complexity penalty term. Also, the tradeoff between datafit and penalty is performed automatically. Gaussian Processes provide a very flexible way for finding a suitable regression model. However, they require the high computational complexity $\mathcal{O}(n^3)$ due to the inversion of the covariance matrix. In addition, the generalization of Gaussian Processes to nonGaussian likelihoods remains complicated. 