Welcome to ShortScience.org! |
[link]
SGD is a widely used optimization method for training the parameters of some model f on some given task. Since the convergence of SGD is related to the variance of the stochastic gradient estimate, there's been a lot of work on trying to come up with such stochastic estimates with smaller variance. This paper does it using an importance sampling (IS) Monte Carlo estimate of the gradient, and learning the proposal distribution $q$ of the IS estimate. The proposal distribution $q$ is parametrized in some way, and is trained to minimize the variance of the gradient estimate. It is trained simultaneously while the model $f$ that SGD (i.e. the SGD that uses IS to get its gradient) is training. To make this whole story more recursive, the proposal distribution $q$ is also trained with SGD :-) This makes sense, since one expects the best proposal to depend on the value of the parameters of model $f$, so the best proposal $q$ should vary as $f$ is trained. One application of this idea is in optimizing a classification model over a distribution that is imbalanced class-wise (e.g. there are classes with much fewer examples). In this case, the proposal distribution determines how frequently we sample examples from each class (conditioned on the class, training examples are chosen uniformly). #### My two cents This is a really cool idea. I particularly like the application to training on an imbalanced classification problem. People have mostly been using heuristics to tackle this problem, such as initially sampling each class equally as often, and then fine-tuning/calibrating the model using the real class proportions. This approach instead proposes a really elegant, coherent, solution to this problem. I would have liked to see a comparison with that aforementioned heuristic (for mainly selfish reasons :-) ). They instead compare with an importance sampling approach with proposal that assigns the same probability to each class, which is a reasonable alternative (though I don't know if it's used as often as the more heuristic approach). There are other applications, to matrix factorization and reinforcement learning, that are presented in the paper and seem neat, though I haven't gone through those as much. Overall, one of my favorite paper this year: it's original, tackles a problem for which I've always hated the heuristic solution I'm using now, proposes an elegant solution to it, and is applicable even more widely than that setting. |
[link]
Tree-based ML models are becoming increasingly popular, but in the explanation space for these type of models is woefully lacking explanations on a local level. Local level explanations can give a clearer picture on specific use-cases and help pin point exact areas where the ML model maybe lacking in accuracy. **Idea**: We need a local explanation system for trees, that is not based on simple decision path, but rather weighs each feature in comparison to every other feature to gain better insight on the model's inner workings. **Solution**: This paper outlines a new methodology using SHAP relative values, to weigh pairs of features to get a better local explanation of a tree-based model. The paper also outlines how we can garner global level explanations from several local explanations, using the relative score for a large sample space. The paper also walks us through existing methodologies for local explanation, and why these are biased toward tree depth as opposed to actual feature importance. The proposed explanation model titled TreeExplainer exposes methods to compute optimal local explanation, garner global understanding from local explanations, and capture feature interaction within a tree based model. This method assigns Shapley interaction values to pairs of features essentially ranking the features so as to understand which features have a higher impact on overall outcomes, and analyze feature interaction. |
[link]
If you read modern (that is, 2018-2020) papers using deep learning on molecular inputs, almost all of them use some variant of graph convolution. So, I decided to go back through the citation chain and read the earliest papers that thought to apply this technique to molecules, to get an idea of lineage of the technique within this domain. This 2015 paper, by Duvenaud et al, is the earliest one I can find. It focuses the entire paper on comparing differentiable, message-passing networks to the state of the art standard at the time, circular fingerprints (more on that in a bit). I really appreciated this approach, which, rather than trying to claim an unrealistic level of novelty, goes into detail on the prior approach, and carves out specific areas of difference. At a high level, the authors' claim is: our model is, in its simplest case, a more flexible and super version of existing work. The unspoken corollary, which ended up being proven true, is that the flexibility of the neural network structure makes it easy to go beyond this initial level of simplicity. Circular Fingerprinting (or, more properly, Extended-Connectivity Circular Fingerprints), is a fascinating algorithm that captures many of the elements of convolution: shared weights, a hierarchy of kernels that match patterns at different scales, and a clever way of aggregating information across an arbitrary number of input nodes. Mechanistically, Circular Fingerprints work by: 1) Taking each atom, and creating a concatenated vector of its basic features, along with the basic features of each atom it's bonded to (with bonded neighbors quasi-randomly) 2) Calculating next-level features by applying some number of hash functions (roughly equivalent to convolutional kernels) to the neighborhood feature vector at the lower level to produce an integer 3) For each feature, setting the value of the fingerprint vector to 1 at the index implied by the integer in step (2) 4) Iterating this process at progressively higher layers, using the hash The effect of this is to assign each index of the vector to an binary feature (modulo hash collisions), where that feature is activated if an exact match is found to a structure within a given atom. Its main downside is that (a) its "kernel" equivalents are fixed and not trainable, since they're just random hashes, and (b) its features represent *exact* matches to lower-level feature patterns, which means you can't have one feature activated to different degrees by variations on a pattern it's identifying. https://i.imgur.com/V8FpfVE.png Duvenaud et al present their alternative in terms of keeping a similar structure, but swapping out fixed and binary components for trainable (because differentiable) and continuous ones. Instead of concatenating a random sorting of atom neighbors to enforce invariance to sorting, they simply sum feature vectors across neighbors, which is also an order-invariantoperation. Instead of applying hash functions, they apply parametrized kernel functions, with the same parameters used across all aggregated neighborhood vectors . This will no longer look for exact matches, but will activate to the extent a structure within an atom matches against a kernel pattern. Then, these features are put into a softmax, which instead setting an index of a vector to a sharp one value, activates different feature indices in the final vector to differing degrees. The final fingerprint is simply the sum of these softmax feature activations for each atom. The authors do a few tests to confirm their substitution is working well, including starting out with a random network (to better approximate the random hash functions), comparing distances between atoms according to either the circular or neural footprint (which had a high correlation), and confirming that that performs similarly to circular fingerprints on a set of supervised learning tasks on modules. When they trained weights to be better than random on three such supervised tasks, they found that their model was comparable or better than circular fingerprints on all three (to break that down: it was basically equivalent on one, and notably better on the other two, according to mean squared error) This really is the simplest possible version of a message-passing or graph convolutional network (it doesn't use edge features, it doesn't calculate features of a neighbor-connection according to the features of each node, etc), but it's really satisfying to see it laid out as a next-step alternative that offered value just by stepping away from exact-match feature dynamics and non-random functions, even without all the sophisticated additions that would later be added to such models.
2 Comments
|
[link]
Main purpose: * This work proposes a software-based resolution augmentation method which is more agile and simpler to implement than hardware engineering solutions. * The paper examines three deep learning single image super resolution techniques on pCLE images * A video-registration based method is proposed to estimate ground truth HR pCLE images (this can be assumed as the main objective of the paper) Highlights: * The papers emphasise that this is the first work to address the image resolution problem in pCLE image acquisitions * The paper introduces useful information on how pCLE devices work * Strong related work * Clear story * Comprehensive evaluation Main Idea: * Use video-registration based techniques to estimate the HR images (real ground truth HR image is not available) * Simulate LR images from estimate HR images with help of Voronoi diagram and Delaunay-based linear interpolation. * Train an Exemplar-based SR model (EBSR -- DL-based approach) to learn the mapping between simulated LR and estimate HR images. Methodology Details * To estimate the HR images, a video-registration based mosaicking techniques (by the same authors in MIA 2006) is used which fuses a collection of input images by averaging the temporal information. * Since mosaicking generates single large filed-of-view mosaic image from LR images, the mosaic-to-image diffeomorphic spatial transformation is used which results from the mosaicking process to propagate and crop the fused information from the mosaic back into each input LR image space. * At this point, the authors observe that the misalignment between input LR images (used in the video-registration based mosaicking technique) and estimate HR cause training problem for the EBSR model. So, they treat the HR images as realistic and chose to simulate LR images from them!!!! * Simulated LR images by obtained using the Voronoi diagram (averaging the Voronoi cell on HR image) + additive noise on estimate HR images. * Finally, they build to experimental datasets 1) LR_org and HR and 2) LR_synth and HR and train three CNN SR models on these twor datasets. * They train FSRCNN, EDSR, SRGAN * The networks are trained using L1+SSIM loss functions Experiment Notes: * SSIM and GCF are used to quantitatively assess the performance of the models. * A composite score is also used to take SSIM and GCF into account jointly * In the ideal case, when the models are trained and etsted on simulated LR and HR images, the quantitative results are convincing. * "From this experiment, it is possible to conclude that the proposed solution is capable of performing SR reconstruction when the models are trained on synthetic data with no domain gap at test time" * When models are trained and tested on original LR and estimate HR images, the performance is not reasonable * When the models are trained on simulated LR images and tested on original LR images, the results become better compared to the previous case, * For a solid conclusion, and MOS study was carried out. The models are trained on simulated LR images. |
[link]
This is follow-up work to the ResNets paper. It studies the propagation formulations behind the connections of deep residual networks and performs ablation experiments. A residual block can be represented with the equations $y_l = h(x_l) + F(x_l, W_l); x_{l+1} = f(y_l)$. $x_l$ is the input to the l-th unit and $x_{l+1}$ is the output of the l-th unit. In the original ResNets paper, $h(x_l) = x_l$, $f$ is ReLu, and F consists of 2-3 convolutional layers (bottleneck architecture) with BN and ReLU in between. In this paper, they propose a residual block with both $h(x)$ and $f(x)$ as identity mappings, which trains faster and performs better than their earlier baseline. Main contributions: - Identity skip connections work much better than other multiplicative interactions that they experiment with: - Scaling $(h(x) = \lambda x)$: Gradients can explode or vanish depending on whether modulating scalar \lambda > 1 or < 1. - Gating ($1-g(x)$ for skip connection and $g(x)$ for function F): For gradients to propagate freely, $g(x)$ should approach 1, but F gets suppressed, hence suboptimal. This is similar to highway networks. $g(x)$ is a 1x1 convolutional layer. - Gating (shortcut-only): Setting high biases pushes initial $g(x)$ towards identity mapping, and test error is much closer to baseline. - 1x1 convolutional shortcut: These work well for shallower networks (~34 layers), but training error becomes high for deeper networks, probably because they impede gradient propagation. - Experiments on activations. - BN after addition messes up information flow, and performs considerably worse. - ReLU before addition forces the signal to be non-negative, so the signal is monotonically increasing, while ideally a residual function should be free to take values in (-inf, inf). - BN + ReLU pre-activation works best. This also prevents overfitting, due to BN's regularizing effect. Input signals to all weight layers are normalized. ## Strengths - Thorough set of experiments to show that identity shortcut connections are easiest for the network to learn. Activation of any deeper unit can be written as the sum of the activation of a shallower unit and a residual function. This also implies that gradients can be directly propagated to shallower units. This is in contrast to usual feedforward networks, where gradients are essentially a series of matrix-vector products, that may vanish, as networks grow deeper. - Improved accuracies than their previous ResNets paper. ## Weaknesses / Notes - Residual units are useful and share the same core idea that worked in LSTM units. Even though stacked non-linear layers are capable of asymptotically approximating any arbitrary function, it is clear from recent work that residual functions are much easier to approximate than the complete function. The [latest Inception paper](http://arxiv.org/abs/1602.07261) also reports that training is accelerated and performance is improved by using identity skip connections across Inception modules. - It seems like the degradation problem, which serves as motivation for residual units, exists in the first place for non-idempotent activation functions such as sigmoid, hyperbolic tan. This merits further investigation, especially with recent work on function-preserving transformations such as [Network Morphism](http://arxiv.org/abs/1603.01670), which expands the Net2Net idea to sigmoid, tanh, by using parameterized activations, initialized to identity mappings. |