ShortScience.org Latest SummariesShortScience.org Latest Summaries
https://shortscience.org
60Sat, 28 May 2022 01:00:02 +00002105.05837journals/corr/2105.058374When Does Contrastive Visual Representation Learning Work?CodyWildThis is a mildly silly paper to summarize, since there isn't really a new mechanism to understand, but rather a number of straightforward (and interesting!) empirical results that are also quite well-explained in the paper itself. That said, for the sake of a tiny bit more brevity than the paper itself provides, I'll try to pull out some of the conclusions I found the most interesting here.
The general goal of this paper is to better understand the contours of when self-supervised representati...
https://shortscience.org/paper?bibtexKey=journals/corr/2105.05837#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2105.05837#decodyngWed, 24 Nov 2021 06:03:37 +00001911.05507journals/corr/abs-1911-055074Compressive Transformers for Long-Range Sequence ModellingCodyWildThis paper is an interesting extension of earlier work, in the TransformerXL paper, that sought to give Transformers access to a "memory" beyond the scope of the subsequence where full self-attention was being performed. This was done by caching the activations from prior subsequences, and making them available to the subsequence currently being calculated in a "read-only" way, with gradients not propagated backwards. This had the effect of (1) reducing the maximum memory size compared to simply...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1911-05507#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1911-05507#decodyngMon, 22 Nov 2021 06:34:58 +00002101.03961journals/corr/2101.039614Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient SparsityCodyWildThe idea of the Switch Transformer is to have more parameters available for a network to use, but to only use a small subset of those parameters for each example that's run through the network. This is achieved through a routing scheme, whereby a weighting layer is applied to each token and produces a set of logits/softmax weights over the set of possible experts. The token is then sent to the expert that was given the highest weight. The network is implemented such that different experts can ac...
https://shortscience.org/paper?bibtexKey=journals/corr/2101.03961#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2101.03961#decodyngFri, 19 Nov 2021 07:12:52 +00001807.11626journals/corr/abs-1807-116264MnasNet: Platform-Aware Neural Architecture Search for MobileCodyWildWhen machine learning models need to run on personal devices, that implies a very particular set of constraints: models need to be fairly small and low-latency when run on a limited-compute device, without much loss in accuracy. A number of human-designed architectures have been engineered to try to solve for these constraints (depthwise convolutions, inverted residual bottlenecks), but this paper's goal is to use Neural Architecture Search (NAS) to explicitly optimize the architecture against l...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1807-11626#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1807-11626#decodyngWed, 17 Nov 2021 02:23:03 +00002010.13321journals/corr/abs-2010-133214View-Invariant, Occlusion-Robust Probabilistic Embedding for Human PoseCodyWildThe goal of this paper is to learn a model that embeds 2D keypoints(the locations of specific key body parts in 2D space) representing a particular pose into a vector embedding where nearby points in embedding space are also nearby in 3D space. This sort of model is useful because the same 3D pose can generate a wide variety of 2D pose projections, and it can be useful to learn which apparently-distinct representations actually map to the same 3D pose.
To do this, the basic approach used by th...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2010-13321#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2010-13321#decodyngTue, 16 Nov 2021 02:15:03 +00001602.05629MahMoo16Communication4Communication-Efficient Learning of Deep Networks from Decentralized DataCodyWildFederated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server.
Historically...
https://shortscience.org/paper?bibtexKey=MahMoo16Communication#decodyng
https://shortscience.org/paper?bibtexKey=MahMoo16Communication#decodyngMon, 15 Nov 2021 07:19:12 +00002110.15349journals/corr/2110.153494Learning to Ground Multi-Agent Communication with AutoencodersCodyWildIn certain classes of multi-agent cooperation games, it's useful for agents to be able to coordinate on future actions, which is an obvious use case for having a communication channel between the two players. However, prior work in multi-agent RL has shown that it's surprisingly hard to train agents that (1) consistently learn to use a communication channel in a way that is informative rather than random, and (2) if they do use communication, can come to a common grounding on the meaning of symb...
https://shortscience.org/paper?bibtexKey=journals/corr/2110.15349#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2110.15349#decodyngSat, 13 Nov 2021 06:50:28 +00002104.11178journals/corr/2104.111784VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and TextCodyWildThis strikes me as a really straightforward, clever, and exciting paper that uses the supervision intrinsic in the visual, audio, and text streams of a video to train a shared multimodal model.
The basic premise is:
- Tokenize all three modalities into a sequence of embedding tokens. For video, split into patches, and linearly project the voxels of these patches to get a per-token representation. For audio, a similar strategy but with waveform patches. For text, the normal per-token embeddin...
https://shortscience.org/paper?bibtexKey=journals/corr/2104.11178#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2104.11178#decodyngFri, 12 Nov 2021 06:26:48 +00001801.04381journals/corr/1801.043814Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and SegmentationCodyWildThis work expands on prior techniques for designing models that can both be stored using fewer parameters, and also execute using fewer operations and less memory, both of which are key desiderata for having trained machine learning models be usable on phones and other personal devices.
The main contribution of the original MobileNets paper was to introduce the idea of using "factored" decompositions of Depthwise and Pointwise convolutions, which separate the procedures of "pull information fr...
https://shortscience.org/paper?bibtexKey=journals/corr/1801.04381#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/1801.04381#decodyngThu, 11 Nov 2021 06:30:30 +00002003.10555DBLP:journals/corr/abs-2003-105554{ELECTRA:} Pre-training Text Encoders as Discriminators Rather Than GeneratorsCodyWildI'm a little embarrassed that I'm only just now reading what seems like a fairly important paper from a year and a half ago, but, in my defense, March 2020 was not the best time for keeping up with the literature in a disciplined way.
Anyhow, musings aside: this paper proposes an alternative training procedure for large language models, which the authors claim result in models that reach strong performance more efficiently than previous BERT, XLNet, or RoBERTa baselines. As some background con...
https://shortscience.org/paper?bibtexKey=DBLP:journals/corr/abs-2003-10555#decodyng
https://shortscience.org/paper?bibtexKey=DBLP:journals/corr/abs-2003-10555#decodyngTue, 09 Nov 2021 03:53:27 +00002103.03206journals/corr/2103.032063Perceiver: General Perception with Iterative AttentionCodyWildThis new architecture out of Deepmind applies combines information extraction and bottlenecks to a traditional Transformer base to get a model that can theoretically apply self-attention to meaningfully larger input sizes than earlier architectures allowed.
Currently, self-attention models are quite powerful and capable, but because attention is quadratic-in-sequence-length in both time, and, often more saliently, memory, it's infeasible to use on long sequences without some modification. This...
https://shortscience.org/paper?bibtexKey=journals/corr/2103.03206#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2103.03206#decodyngSun, 07 Nov 2021 03:18:35 +00002006.03236journals/corr/abs-2006-032364Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language ProcessingCodyWildThis was an amusingly-timed paper for me to read, because just yesterday I was listening to a different paper summary where the presenter offhandedly mentioned the idea of compressing the sequence length in Transformers through subsequent layers (the way a ConvNet does pooling to a smaller spatial dimension in the course of learning), and it made me wonder why I hadn't heard much about that as an approach. And, lo, I came on this paper in my list the next day, which does exactly that.
As a ref...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-03236#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-03236#decodyngFri, 05 Nov 2021 05:30:06 +00002011.12948journals/corr/2011.129483Nerfies: Deformable Neural Radiance FieldsCodyWildThis summary builds substantially on my summary of NERFs, so if you haven't yet read that, I recommend doing so first!
The idea of a NERF is learn a neural network that represents a 3D scene, and from which you can, once the model is trained, sample an image of that scene from any desired angle. This involves structuring your neural network as a function that predicts the RGB color and density/opacity for a given point in 3D space (x, y, z), from a given viewing angle (theta, phi). With such a...
https://shortscience.org/paper?bibtexKey=journals/corr/2011.12948#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2011.12948#decodyngThu, 04 Nov 2021 03:08:28 +00002003.08934mildenhall2020representing4NeRF: Representing Scenes as Neural Radiance Fields for View SynthesisCodyWildThis summary builds extensively on my prior summary of SIRENs, so if you haven't read that summary or the underlying paper yet, I'd recommend doing that first!
At a high level, the idea of SIRENs is to use a neural network to learn a compressed, continuous representation of an image, where the neural network encodes a mapping from (x, y) to the pixel value at that location, and the image can be reconstructed (or, potentially, expanded in size) by sampling from that function across the full ran...
https://shortscience.org/paper?bibtexKey=mildenhall2020representing#decodyng
https://shortscience.org/paper?bibtexKey=mildenhall2020representing#decodyngWed, 03 Nov 2021 01:19:18 +00002006.09661sitzmann2020implicit3Implicit Neural Representations with Periodic Activation FunctionsCodyWild[First off, full credit that this summary is essentially a distilled-for-my-own-understanding compression of Yannic Kilcher's excellent video on the topic]
I'm interested in learning more about Neural Radiance Fields (or NERFs), a recent technique for learning a representation of a scene that lets you generate multiple views from it, and a paper referenced as a useful prerequisite for that technique was SIRENs, or Sinuisodial Representation Networks. In my view, the most complex part of unders...
https://shortscience.org/paper?bibtexKey=sitzmann2020implicit#decodyng
https://shortscience.org/paper?bibtexKey=sitzmann2020implicit#decodyngTue, 02 Nov 2021 06:55:52 +00002110.15149journals/corr/2110.151492Diversity-Driven Combination for Grammatical Error CorrectionLeshem ChoshenModel combination\ensembling:
Average ensembling is practical - but naive.
Combine considering each network's strengths, much better!
Moreover, let's make the networks diverse so they will have different strengths.
Wenjuan Han & Hwee Tou Ng (no twitters?)
#enough2skim #NLProc
The basic idea is quite simple:
Given some models, why would we want the average? We want to rely on each one(or group) when it is more likely to be the correct one.
This was actually introduced in our previous work (as a...
https://shortscience.org/paper?bibtexKey=journals/corr/2110.15149#borgr
https://shortscience.org/paper?bibtexKey=journals/corr/2110.15149#borgrMon, 01 Nov 2021 06:33:12 +00002108.10763journals/corr/2108.107632ComSum: Commit Messages Summarization and Meaning PreservationLeshem ChoshenHuge 𝙘𝙤𝙢𝙢𝙞𝙩 𝙨𝙪𝙢𝙢𝙖𝙧𝙞𝙯𝙖𝙩𝙞𝙤𝙣 dataset
The dataset cleans tons of open source projects to have only ones with high quality committing habits
(e.g. large active projects with commits that are of significant length etc.)
We present some ways to evaluate that the meaning was kept while summarizing, so you can go beyond ROUGE
We provide a strict split that keeps some (thousand+-) repositories totally out of the training, so you can check in domai...
https://shortscience.org/paper?bibtexKey=journals/corr/2108.10763#borgr
https://shortscience.org/paper?bibtexKey=journals/corr/2108.10763#borgrSun, 24 Oct 2021 09:32:17 +00002102.09475journals/corr/2102.094753Gifsplanation via Latent Shift: A Simple Autoencoder Approach to Counterfactual Generation for Chest X-raysJoseph Paul Cohen**Background:** The goal of this work is to indicate image features which are relevant to the prediction of a neural network and convey that information to the user by displaying a counterfactual image animation.
**The Latent Shift Method:** This method works on any pretrained encoder/decoder and classifier which is differentiable. No special considerations are needed during model training. With this approach they want the exact opposite of an adversarial attack but it is using the same idea. T...
https://shortscience.org/paper?bibtexKey=journals/corr/2102.09475#joecohen
https://shortscience.org/paper?bibtexKey=journals/corr/2102.09475#joecohenFri, 02 Jul 2021 17:19:32 +0000journals/prl/BailoRJPBK182Efficient adaptive non-maximal suppression algorithms for homogeneous spatial keypoint distributionOleksandr BailoKeypoint detection is an important step in various tasks such as SLAM, panorama stitching, camera calibration, and more. Efficient keypoint detectors, FAST (Features from Accelerated and Segments Test) for example, would detect keypoints where a relatively high brightness change is observed in relation to surrounding pixels. Most probably, the keypoints would be located on edges, as shown below:
Let's consider another image shown below. Here, while the detector is capable of detecting many keyp...
https://shortscience.org/paper?bibtexKey=journals/prl/BailoRJPBK18#ukrdailo
https://shortscience.org/paper?bibtexKey=journals/prl/BailoRJPBK18#ukrdailoSun, 07 Feb 2021 10:58:53 +000010.1038/s41586-019-1923-72Improved protein structure prediction using potentials from deep learningCodyWildIn January of this year (2020), DeepMind released a model called AlphaFold, which uses convolutional networks atop sequence-based and evolutionary features to predict protein folding structure. In particular, their model was designed to predict a distribution for how far away each pair of amino acids will be from one another in the final folded structure. Given such a trained model, you can score a candidate structure according to how likely it is under the model, and - if your process for gener...
https://shortscience.org/paper?bibtexKey=10.1038/s41586-019-1923-7#decodyng
https://shortscience.org/paper?bibtexKey=10.1038/s41586-019-1923-7#decodyngTue, 01 Dec 2020 02:28:52 +00002007.12223journals/corr/abs-2007-122233The Lottery Ticket Hypothesis for Pre-trained BERT NetworksCodyWildThis is an interesting paper, investigating (with a team that includes the original authors of the Lottery Ticket paper) whether the initializations that result from BERT pretraining have Lottery Ticket-esque properties with respect to their role as initializations for downstream transfer tasks.
As background context, the Lottery Ticket Hypothesis came out of an observation that trained networks could be pruned to remove low-magnitude weights (according to a particular iterative pruning strate...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2007-12223#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2007-12223#decodyngMon, 30 Nov 2020 01:54:47 +00001905.10295journals/corr/abs-1905-102952Learning to learn via Self-CritiqueMikhail Meskhi### Key points
- Instead of just focusing on supervised learning, a self-critique and adapt network provides a unsupervised learning approach in improving the overall generalization. It does this via transductive learning by learning a label-free loss function from the validation set to improve the base model.
- The SCA framework helps a learning algorithm be more robust by learning more relevant features and improve during the training phase.
### Ideas
1. Combine deep learning models with SC...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1905-10295#michaelmmeskhi
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1905-10295#michaelmmeskhiSat, 28 Nov 2020 21:58:53 +00002006.07589journals/corr/abs-2006-075892Adversarial Self-Supervised Contrastive LearningCodyWildThis a nice, compact paper testing a straightforward idea: can we use the contrastive loss structure so widespread in unsupervised learning as a framework for generating and training against adversarial examples? In the context of the adversarial examples literature, adversarial training - or, training against examples that were adversarially generated so as to minimize the loss of the model you're training - is the primary strategy used to train robust models (robust here in the sense of not be...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-07589#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-07589#decodyngSat, 28 Nov 2020 21:00:26 +00002007.00224journals/corr/2007.002242Debiased Contrastive LearningCodyWildThe premise of contrastive loss is that we want to push together the representations of objects that are similar, and push dissimilar representations farther apart. However, in an unlabeled setting, we don't generally have class labels to tell which images (or objects in general) are supposed to be similar or dissimilar along the axes that matter to us, so we use the shortcut of defining some transformation on a given anchor frame that gets us a frame we're confident is related enough to that an...
https://shortscience.org/paper?bibtexKey=journals/corr/2007.00224#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2007.00224#decodyngFri, 27 Nov 2020 21:00:39 +00002007.02835journals/corr/abs-2007-028353GROVER: Self-supervised Message Passing Transformer on Large-scale Molecular DataCodyWildLarge-scale transformers on unsupervised text data have been wildly successful in recent years; arguably, the most successful single idea in the last ~3 years of machine learning. Given that, it's understandable that different domains within ML want to take their shot at seeing whether the same formula will work for them as well. This paper applies the principles of (1) transformers and (2) large-scale unlabeled data to the problem of learning informative embeddings of molecular graphs.
Labeli...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2007-02835#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2007-02835#decodyngThu, 26 Nov 2020 20:44:45 +00002004.02860journals/corr/abs-2004-028602Weakly-Supervised Reinforcement Learning for Controllable BehaviorCodyWildI tried my best, but I'm really confused by the central methodology of this paper. Here are the things I do understand:
1. The goal of the method is to learn disentangled representations, and, specifically, to learn representations that correspond to factors of variation in the environment that are selected by humans. That means, we ask humans whether a given image is higher or lower on a particular relevant axis, and aggregate those rankings into a vector, where a particular index of the vect...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2004-02860#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2004-02860#decodyngThu, 26 Nov 2020 04:48:23 +00002002.11328yang2020rethinking2Rethinking Bias-Variance Trade-off for Generalization of Neural NetworksCodyWildThis is a really cool paper that posits a relatively simple explanation for the strange phenomena known as double descent - both the fact of seeing it in the first place, and the difficulty in robustly causing it to appear. In the classical wisdom of statistics, increasing model complexity too far will lead to increase in variance, and thus an increase in test error (or "test risk" or "empirical risk"), leading to a U-shaped test error curve as a function of model complexity. Double descent is t...
https://shortscience.org/paper?bibtexKey=yang2020rethinking#decodyng
https://shortscience.org/paper?bibtexKey=yang2020rethinking#decodyngTue, 24 Nov 2020 05:26:23 +00002006.15134journals/corr/2006.151343Critic Regularized RegressionCodyWildOffline reinforcement learning is potentially high-value thing for the machine learning community learn to do well, because there are many applications where it'd be useful to generate a learnt policy for responding to a dynamic environment, but where it'd be too unsafe or expensive to learn in an on-policy or online way, where we continually evaluate our actions in the environment to test their value. In such settings, we'd like to be able to take a batch of existing data - collected from a hum...
https://shortscience.org/paper?bibtexKey=journals/corr/2006.15134#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2006.15134#decodyngMon, 23 Nov 2020 05:52:49 +00002006.06936journals/corr/abs-2006-069364Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?CodyWildThis paper is ultimately relatively straightforward, for all that it's embedded in the somewhat new-to-me literature around graph-based Neural Architecture Search - the problem of iterating through options to find a graph representing an optimized architecture. The authors want to understand whether in this problem, as in many others in deep learning, we can benefit from building our supervised models off of representations learned during an unsupervised pretraining step. In this case, the unsup...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-06936#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-06936#decodyngSun, 22 Nov 2020 02:10:17 +00002006.12433journals/corr/2006.124333What shapes feature representations? Exploring datasets, architectures, and trainingCodyWildThis is a nice little empirical paper that does some investigation into which features get learned during the course of neural network training. To look at this, it uses a notion of "decodability", defined as the accuracy to which you can train a linear model to predict a given conceptual feature on top of the activations/learned features at a particular layer. This idea captures the amount of information about a conceptual feature that can be extracted from a given set of activations.
They wo...
https://shortscience.org/paper?bibtexKey=journals/corr/2006.12433#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2006.12433#decodyngSat, 21 Nov 2020 04:57:58 +00002007.01293ren2020unlabeled3Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised LearningCodyWildThis paper argues that, in semi-supervised learning, it's suboptimal to use the same weight for all examples (as happens implicitly, when the unsupervised component of the loss for each example is just added together directly. Instead, it tries to learn weights for each specific data example, through a meta-learning-esque process.
The form of semi-supervised learning being discussed here is label-based consistency loss, where a labeled image is augmented and run through the current version of ...
https://shortscience.org/paper?bibtexKey=ren2020unlabeled#decodyng
https://shortscience.org/paper?bibtexKey=ren2020unlabeled#decodyngFri, 20 Nov 2020 04:05:54 +00002007.14062journals/corr/abs-2007-140623Big Bird: Transformers for Longer SequencesCodyWildTransformers - powered by self-attention mechanisms - have been a paradigm shift in NLP, and are now the standard choice for training large language models. However, while transformers do have many benefits in terms of computational constraints - most saliently, that attention between tokens can be computed in parallel, rather than needing to be evaluated sequentially like in a RNN - a major downside is their memory (and, secondarily, computational) requirements. The baseline form of self-attent...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2007-14062#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2007-14062#decodyngThu, 19 Nov 2020 02:32:44 +00002006.07710journals/corr/abs-2006-077103The Pitfalls of Simplicity Bias in Neural NetworksCodyWildThis is an interesting paper that makes a fairly radical claim, and I haven't fully decided whether what they find is an interesting-but-rare corner case, or a more fundamental weakness in the design of neural nets. The claim is: neural nets prefer learning simple features, even if there exist complex features that are equally or more predictive, and even if that means learning a classifier with a smaller margin - where margin means "the distance between the decision boundary and the nearest-by ...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-07710#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-07710#decodyngSun, 15 Nov 2020 22:46:11 +00002010.11924journals/corr/abs-2010-119242In Search of Robust Measures of GeneralizationCodyWildGeneralization is, if not the central, then at least one of the central mysteries of deep learning. We are somehow able to able to train high-capacity, overparametrized models, that empirically have the capacity to fit to random data - meaning that they have the capacity to memorize the labeled data we give them - and which yet still manage to train functions that generalize to test data. People have tried to come up with generalization bounds - that is, bounds on the expected test error of a mo...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2010-11924#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2010-11924#decodyngSat, 14 Nov 2020 22:31:16 +00002006.06882journals/corr/abs-2006-068823Rethinking Pre-training and Self-trainingCodyWild Occasionally, I come across results in machine learning that I'm glad exist, even if I don't fully understand them, precisely because they remind me how little we know about the complicated information architectures we're building, and what kinds of signal they can productively use. This is one such result.
The paper tests a method called self-training, and compares it against the more common standard of pre-training. Pre-training works by first training your model on a different dataset, in ...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-06882#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-06882#decodyngSat, 14 Nov 2020 05:00:22 +00002010.02302journals/corr/abs-2010-023022Latent World Models For Intrinsically Motivated ExplorationCodyWildThe thing I think is happening here:
It proposes a self-supervised learning scheme (which...seems fairly basic, but okay) to generate encodings. It then trains a Latent World Model, which takes in the current state encoding, the action, and the belief state (I think just the prior RNN state?) and predicts a next state. The intrinsic reward is the difference between this and the actual encoding of the next step. (This is dependent on a particular action and resulting next obs, it seems). I don'...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2010-02302#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2010-02302#decodyngThu, 12 Nov 2020 05:26:18 +00001911.09071journals/corr/abs-1911-090713Exploring the Origins and Prevalence of Texture Bias in Convolutional Neural NetworksCodyWildWhen humans classify images, we tend to use high-level information about the shape and position of the object. However, when convolutional neural networks classify images,, they tend to use low-level, or textural, information more than high-level shape information. This paper tries to understand what factors lead to higher shape bias or texture bias.
To investigate this, the authors look at three datasets with disagreeing shape and texture labels. The first is GST, or Geirhos Style Transfer. I...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1911-09071#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1911-09071#decodyngWed, 11 Nov 2020 07:08:22 +00002008.11687journals/corr/abs-2008-116873What is being transferred in transfer learning?CodyWildThis is an interesting - and refreshing - paper, in that, instead of trying to go all-in on a particular theoretical point, the authors instead run a battery of empirical investigations, all centered around the question of how to explain what happens to make transfer learning work. The experiments don't all line up to support a single point, but they do illustrate different interesting facets of the transfer process.
- An initial experiment tries to understand how much of the performance of fi...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2008-11687#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2008-11687#decodyngTue, 10 Nov 2020 06:58:27 +00002010.12050journals/corr/abs-2010-120503Contrastive Learning with Adversarial ExamplesCodyWildContrastive learning works by performing augmentations on a batch of images, and training a network to match the representations of the two augmented parts of a pair together, and push the representations of images not in a pair farther apart. Historically, these algorithms have benefitted from using stronger augmentations, which has the effect of making the two positive elements in a pair more visually distinct from one another. This paper tries to build on that success, and, beyond just using ...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2010-12050#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2010-12050#decodyngMon, 09 Nov 2020 02:03:47 +00002004.11362journals/corr/2004.113623Supervised Contrastive LearningCodyWildThis was a really cool-to-me paper that asked whether contrastive losses, of the kind that have found widespread success in semi-supervised domains, can add value in a supervised setting as well. In a semi-supervised context, contrastive loss works by pushing together the representations of an "anchor" data example with an augmented version of itself (which is taken as a positive or target, because the image is understood to not be substantively changed by being augmented), and pushing the repre...
https://shortscience.org/paper?bibtexKey=journals/corr/2004.11362#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2004.11362#decodyngSat, 07 Nov 2020 23:30:17 +00002006.10455journals/corr/abs-2006-104552What Do Neural Networks Learn When Trained With Random Labels?CodyWildThis is another paper that was a bit of a personal-growth test for me to try to parse, since it's definitely heavier on analytical theory than I'm used to, but I think I've been able to get something from it, even though I'll be the first to say I didn't understand it entirely.
The question of this paper is: why does it seem to be the case that training a neural network on a data distribution - but with your supervised labels randomly sampled - seems to afford some level of advantage when fine...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-10455#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-10455#decodyngSat, 07 Nov 2020 00:15:03 +00002007.13916journals/corr/abs-2007-139163Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset BiasesCodyWildIn the past year or so, contrastive learning has experienced widespread success, and has risen to be a dominant problem framing within self-supervised learning. The basic idea of contrastive learning is that, instead of needing human-generated labels to generate a supervised task, you instead assume that there exists some automated operation you can perform to a data element to generate another data element that, while different, should be considered still fundamentally the same, or at least mor...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2007-13916#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2007-13916#decodyngFri, 06 Nov 2020 04:39:42 +00002002.00632journals/corr/abs-2002-006323Effective Diversity in Population-Based Reinforcement LearningCodyWildA central problem in the domain of reinforcement learning is how to incentivize exploration and diversity of experience, since RL agents can typically only learn from states they go to, and it can often be the case that states with high reward don't have an obvious trail of high-reward states leading to them, meaning that algorithms that are naively optimizing for reward will be relatively unlikely to discover them. One potential way to promote exploration is to train an ensemble of agents, and ...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2002-00632#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2002-00632#decodyngWed, 04 Nov 2020 00:44:40 +00002007.08794journals/corr/2007.087943Discovering Reinforcement Learning AlgorithmsCodyWildThis work attempts to use meta-learning to learn an update rule for a reinforcement learning agent. In this context, "learning an update rule" means learning the parameters of an LSTM module that takes in information about the agent's recent reward and current model and outputs two values - a scalar and a vector - that are used to update the agent's model. I'm not going to go too deep into meta-learning here, but, at a high level, meta learning methods optimize parameters governing an agent's le...
https://shortscience.org/paper?bibtexKey=journals/corr/2007.08794#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2007.08794#decodyngTue, 03 Nov 2020 05:29:13 +00002006.04635journals/corr/abs-2006-046353Learning to Play No-Press Diplomacy with Best Response Policy IterationCodyWildThis paper focuses on an effort by a Deepmind team to train an agent that can play the game Diplomacy - a complex, multiplayer game where players play as countries controlling units, trying to take over the map of Europe. Some relevant factors of this game, for the purposes of this paper, are:
1) All players move at the same time, which means you need to model your opponent's current move, and play a move that succeeds in expectation over that predicted move distribution. This also means that,...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-04635#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2006-04635#decodyngMon, 02 Nov 2020 06:15:17 +000010.1101/2020.02.07.9388523Tumor Phylogeny Topology Inference via Deep LearningGavin GrayA very simple (but impractical) discrete model of subclonal evolution would include the following events:
* Division of a cell to create two cells:
* **Mutation** at a location in the genome of the new cells
* Cell death at a new timestep
* Cell survival at a new timestep
Because measurements of mutations are usually taken at one time point, this is taken to be at the end of a time series of these events, where a tiny of subset of cells are observed and a **genotype matrix** $A$ is produce...
https://shortscience.org/paper?bibtexKey=10.1101/2020.02.07.938852#gngdb
https://shortscience.org/paper?bibtexKey=10.1101/2020.02.07.938852#gngdbWed, 16 Sep 2020 15:59:52 +00001805.08296journals/corr/1805.082962Data-Efficient Hierarchical Reinforcement LearningFelipe Martins# Keypoints
- Proposes the HIerarchical Reinforcement learning with Off-policy correction (**HIRO**) algorithm.
- Does not require careful task-specific design.
- Generic goal representation to make it broadly applicable, without any manual design of goal spaces, primitives, or controllable dimensions.
- Use of off-policy experience using a novel off-policy correction.
- A two-level hierarchy architecture
- A higher-level controller outputs a goal for the lower-level controller every **c** ti...
https://shortscience.org/paper?bibtexKey=journals/corr/1805.08296#felipemartins
https://shortscience.org/paper?bibtexKey=journals/corr/1805.08296#felipemartinsTue, 01 Sep 2020 00:38:54 +000010.1109/isbi45749.2020.90986862Bayesian Skip-Autoencoders for Unsupervised Hyperintense Anomaly Detection in High Resolution Brain MriFriedrich-Maximilian WeberlingThe reconstruction of high-fidelity resolution brain MR images is especially challenging because of the highly complex brain structure. Most promising approaches for this task are autoencoders and generative models such as Variational Autoencoders (VAE) or Generative Adversarial Networks (GAN). In Unsupervised Anomaly Detection (UAD), these architectures are only trained with images of healthy brain anatomy and not with images containing anomalies such as lesions. Therefore, processing an anomal...
https://shortscience.org/paper?bibtexKey=10.1109/isbi45749.2020.9098686#fweberling1995
https://shortscience.org/paper?bibtexKey=10.1109/isbi45749.2020.9098686#fweberling1995Mon, 31 Aug 2020 09:18:08 +00001809.01999journals/corr/1809.019992Recurrent World Models Facilitate Policy EvolutionPaul Barde## General Framework
The take-home message is that the challenge of Reinforcement Learning for environments with high-dimensional and partial observations is learning a good representation of the environment. This means learning a sensory features extractor V to deal with the highly dimensional observation (pixels for example). But also learning a temporal representation M of the environment dynamics to deal with the partial observability. If provided with such representations, learning a contr...
https://shortscience.org/paper?bibtexKey=journals/corr/1809.01999#muntermulehitch
https://shortscience.org/paper?bibtexKey=journals/corr/1809.01999#muntermulehitchMon, 27 Jul 2020 13:05:14 +00001907.03976journals/corr/1907.039763Better-than-Demonstrator Imitation Learning via Automatically-Ranked DemonstrationsPaul Barde## General Framework
Extends T-REX (see [summary]()) so that preferences (rankings) over demonstrations are generated automatically (back to the common IL/IRL setting where we only have access to a set of unlabeled demonstrations). Also derives some theoretical requirements and guarantees for better-than-demonstrator performance.
## Motivations
* Preferences over demonstrations may be difficult to obtain in practice.
* There is no theoretical understanding of the requirements that lead to out...
https://shortscience.org/paper?bibtexKey=journals/corr/1907.03976#muntermulehitch
https://shortscience.org/paper?bibtexKey=journals/corr/1907.03976#muntermulehitchMon, 27 Jul 2020 02:22:27 +00001904.06387journals/corr/1904.063872Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from ObservationsPaul Barde## General Framework
Only access to a finite set of **ranked demonstrations**. The demonstrations only contains **observations** and **do not need to be optimal** but must be (approximately) ranked from worst to best.
The **reward learning part is off-line** but not the policy learning part (requires interactions with the environment).
In a nutshell: learns a reward models that looks at observations. The reward model is trained to predict if a demonstration's ranking is greater than another on...
https://shortscience.org/paper?bibtexKey=journals/corr/1904.06387#muntermulehitch
https://shortscience.org/paper?bibtexKey=journals/corr/1904.06387#muntermulehitchMon, 27 Jul 2020 02:18:47 +000010.15607/rss.2016.xii.0292Planning for Autonomous Cars that Leverage Effects on Human ActionsPaul Barde## General Framework
*wording: car = the autonomous car, driver = the other car it is interacting with*
Builds a model of an **autonomous car's influence over the behavior of an interacting driver** (human or simulated) that the autonomous car can leverage to plan more efficiently. The driver is modeled by the policy that maximizes his defined objective. In brief, a **linear reward function is learned off-line with IRL on human demonstrations** and the modeled policy takes the actions that max...
https://shortscience.org/paper?bibtexKey=10.15607/rss.2016.xii.029#muntermulehitch
https://shortscience.org/paper?bibtexKey=10.15607/rss.2016.xii.029#muntermulehitchMon, 27 Jul 2020 02:14:17 +00001406.5979journals/corr/1406.59792Reinforcement and Imitation Learning via Interactive No-Regret LearningPaul Barde## General Framework
Really **similar to DAgger** (see [summary]()) but considers **cost-sensitive classification** ("some mistakes are worst than others": you should be more careful in imitating that particular action of the expert if failing in doing so incurs a large cost-to-go). By doing so they improve from DAgger's bound of $\epsilon_{class}uT$ where $u$ is the difference in cost-to-go (between the expert and one error followed by expert policy) to $\epsilon_{class}T$ where $\epsilon_{cla...
https://shortscience.org/paper?bibtexKey=journals/corr/1406.5979#muntermulehitch
https://shortscience.org/paper?bibtexKey=journals/corr/1406.5979#muntermulehitchMon, 27 Jul 2020 02:08:30 +00001011.0686journals/corr/1011.06862A Reduction of Imitation Learning and Structured Prediction to No-Regret Online LearningPaul Barde## General Framework
The imitation learning problem is here cast into a classification problem: label the state with the corresponding expert action. With this, you can see structured prediction (predict next label knowing your previous prediction) as a degenerated IL problem. They make the **reduction assumption** that you can make the probability of mistake $\epsilon$ as small as desired on the **training distribution** (expert or mixture). They also assume that the difference in the cost-to-g...
https://shortscience.org/paper?bibtexKey=journals/corr/1011.0686#muntermulehitch
https://shortscience.org/paper?bibtexKey=journals/corr/1011.0686#muntermulehitchMon, 27 Jul 2020 01:53:35 +00001611.03530journals/corr/1611.035302Understanding deep learning requires rethinking generalizationANIRUDH NJ## Summary
The broad goal of this paper is to understand how a neural network learns the underlying distribution of the input data and the properties of the network that describes its generalization power.
Previous literature tries to use statistical measures like Rademacher complexity, uniform stability and VC dimension to explain the generalization error of the model. These methods explain generalization in terms of the number of parameters in the model along with the applied regularizat...
https://shortscience.org/paper?bibtexKey=journals/corr/1611.03530#anirudhnj
https://shortscience.org/paper?bibtexKey=journals/corr/1611.03530#anirudhnjFri, 26 Jun 2020 15:33:03 +0000journals/af/Maymin112Markets are efficient if and only if P = NPquaxtonIs the market efficient? This is perhaps the most prevalent question in all of finance. While this paper does not aim to answer that question, it does frame it in an information-theoretic context. Mainly, Maymin shows that at least the weak form of the efficient market hypothesis (EMH) holds if and only if P = NP.
First, he defines what efficient market means:
"The weakest form of the EMH states that future prices cannot be predicted by analyzing prices from the past. Therefore, technical ana...
https://shortscience.org/paper?bibtexKey=journals/af/Maymin11#jyang772
https://shortscience.org/paper?bibtexKey=journals/af/Maymin11#jyang772Thu, 04 Jun 2020 02:53:53 +0000conf/iclr/RendaFC203Comparing Rewinding and Fine-tuning in Neural Network PruningCodyWildThis is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has bee...
https://shortscience.org/paper?bibtexKey=conf/iclr/RendaFC20#decodyng
https://shortscience.org/paper?bibtexKey=conf/iclr/RendaFC20#decodyngFri, 15 May 2020 03:18:21 +00002004.13649journals/corr/2004.136492Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from PixelsCodyWildOne of the most notable flaws of modern model-free reinforcement learning is its sample inefficiency; where humans can learn a new task with relatively few examples, model that learn policies or value functions directly from raw data need huge amounts of data to train properly. Because the model isn't given any semantic features, it has to learn a meaningful representation from raw pixels using only the (often sparse, often noisy) signal of reward. Some past approaches have tried learning repres...
https://shortscience.org/paper?bibtexKey=journals/corr/2004.13649#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2004.13649#decodyngSun, 10 May 2020 05:46:18 +00001903.11981journals/corr/abs-1903-119813Regularizing Trajectory Optimization with Denoising AutoencodersRobert MüllerThe typical model based reinforcement learning (RL) loop consists of collecting data, training a model of the environment, using the model to do model predictive control (MPC). If however the model is wrong, for example for state-action pairs that have been barely visited, the dynamics model might be very wrong and the MPC fails as the imagined model and the reality align to longer. Boney et a. propose to tackle this with a denoising autoencoder for trajectory regularization according to the fam...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1903-11981#robertmueller
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1903-11981#robertmuellerThu, 07 May 2020 08:08:00 +00001912.05500journals/corr/abs-1912-055002What Can Learned Intrinsic Rewards Capture?CodyWildThis paper out of DeepMind is an interesting synthesis of ideas out of the research areas of meta learning and intrinsic rewards. The hope for intrinsic reward structures in reinforcement learning - things like uncertainty reduction or curiosity - is that they can incentivize behavior like information-gathering and exploration, which aren't incentivized by the explicit reward in the short run, but which can lead to higher total reward in the long run. So far, intrinsic rewards have mostly been ...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1912-05500#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1912-05500#decodyngTue, 05 May 2020 06:22:03 +0000conf/icml/FinnAL172Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksAndrea Walter Ruggerini## TL;DR
The paper presents a model-agnostic strategy to perform few-shot learning taking advantage of prior knowledge acquired during in multitask learning. Such prior knowledge derives from priors acquired about generalized model parameters (e.g. weights or hyperparameters) during the Model Agnostic Meta-Learning (MAML) algorithm. The strategy can be applied to any algorithm trained with gradient descent (not only neural networks) being more general and perhaps effective than transfer learnin...
https://shortscience.org/paper?bibtexKey=conf/icml/FinnAL17#andreaw
https://shortscience.org/paper?bibtexKey=conf/icml/FinnAL17#andreawSun, 03 May 2020 14:29:05 +00002001.04451journals/corr/2001.044512Reformer: The Efficient TransformerCodyWildThe Transformer architecture - which uses a structure entirely based on key-value attention mechanisms to process sequences such as text - has taken over the worlds of language modeling and NLP in the past three years. However, Transformers at the scale used for large language models have huge computational and memory requirements.
This is largely driven by the fact that information at every step in the sequence (or, in the so-far-generated sequence during generation) is used to inform the rep...
https://shortscience.org/paper?bibtexKey=journals/corr/2001.04451#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/2001.04451#decodyngSun, 03 May 2020 05:14:23 +00001909.11655journals/corr/abs-1909-116552Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical SpaceCodyWildI found this paper a bit difficult to fully understand. Its premise, as far as I can follow, is that we may want to use genetic algorithms (GA), where we make modifications to elements in a population, and keep elements around at a rate proportional to some set of their desirable properties. In particular we might want to use this approach for constructing molecules that have properties (or predicted properties) we want. However, a downside of GA is that its easy to end up in local minima, where...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1909-11655#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1909-11655#decodyngFri, 01 May 2020 05:38:46 +0000conf/nips/KumarFSTL193Stabilizing Off-Policy Q-Learning via Bootstrapping Error ReductionRobert MüllerKumar et al. propose an algorithm to learn in batch reinforcement learning (RL), a setting where an agent learns purely form a fixed batch of data, $B$, without any interactions with the environments. The data in the batch is collected according to a batch policy $\pi_b$. Whereas most previous methods (like BCQ) constrain the learned policy to stay close to the behavior policy, Kumar et al. propose bootstrapping error accumulation reduction (BEAR), which constrains the newly learned policy to pl...
https://shortscience.org/paper?bibtexKey=conf/nips/KumarFSTL19#robertmueller
https://shortscience.org/paper?bibtexKey=conf/nips/KumarFSTL19#robertmuellerThu, 30 Apr 2020 13:31:29 +000010.1101/2020.03.03.9721332AI-aided design of novel targeted covalent inhibitors against SARS-CoV-2CodyWildThis preprint is a bit rambling, and I don't know that I fully followed what it was doing, but here's my best guess:
- We think it's probably the case that SARS-COV2 (COVID19) uses a protease (enzyme involved in its reproduction) that isn't available and co-optable in the human body, and is also quite similar to the comparable protease protein in the original SARS virus. Therefore, it is hoped that we might be able to take inhibitors that bind to SARS, and modify them in small ways to make t...
https://shortscience.org/paper?bibtexKey=10.1101/2020.03.03.972133#decodyng
https://shortscience.org/paper?bibtexKey=10.1101/2020.03.03.972133#decodyngThu, 30 Apr 2020 04:36:33 +00002003.03123journals/corr/abs-2003-031232Directional Message Passing for Molecular GraphsCodyWildThis paper, presented this week at ICLR 2020, builds on existing applications of message-passing Graph Neural Networks (GNN) for molecular modeling (specifically: for predicting quantum properties of molecules), and extends them by introducing a way to represent angles between atoms, rather than just distances between them, as current methods are limited to.
The basic version of GNNs on molecule data works by creating features attached to atoms at each level (starting at level 0 with the eleme...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2003-03123#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-2003-03123#decodyngWed, 29 Apr 2020 03:42:52 +00001911.11361journals/corr/abs-1911-113613Behavior Regularized Offline Reinforcement LearningRobert MüllerWu et al. provide a framework (behavior regularized actor critic (BRAC)) which they use to empirically study the impact of different design choices in batch reinforcement learning (RL). Specific instantiations of the framework include BCQ, KL-Control and BEAR.
Pure off-policy rl describes the problem of learning a policy purely from a batch $B$ of one step transitions collected with a behavior policy $\pi_b$. The setting allows for no further interactions with the environment. This learning re...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1911-11361#robertmueller
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1911-11361#robertmuellerMon, 27 Apr 2020 13:02:23 +00001908.06760journals/corr/abs-1908-067602Self-Attention Based Molecule Representation for Predicting Drug-Target InteractionCodyWildIn the last three years, Transformers, or models based entirely on attention for aggregating information from across multiple places in a sequence, have taken over the world of NLP. In this paper, the authors propose using a Transformer to learn a molecular representation, and then building a model to predict drug/target interaction on top of that learned representation. A drug/target interaction model takes in two inputs - a protein involved in a disease pathway, and a (typically small) molecul...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1908-06760#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1908-06760#decodyngSun, 26 Apr 2020 06:39:30 +0000journals/iacr/BellareRRS092Format-Preserving EncryptionquaxtonFormat-preserving encryption is a deterministic encryption scheme that encrypts plaintext of some specified format into ciphertext of the same format. This has a lot of practical use cases such as storing SSN or credit card information, without having to change the underlying schematics of the database or application that stores the data. The protected data is in-differentiable from unprotected data, and still enables some analytics over it, such as with masking (ie, displaying last four digits ...
https://shortscience.org/paper?bibtexKey=journals/iacr/BellareRRS09#jyang772
https://shortscience.org/paper?bibtexKey=journals/iacr/BellareRRS09#jyang772Thu, 23 Apr 2020 22:05:16 +0000conf/ac/Rasmussen034Gaussian Processes in Machine LearningFriedrich-Maximilian WeberlingIn this tutorial paper, Carl E. Rasmussen gives an introduction to Gaussian Process Regression focusing on the definition, the hyperparameter learning and future research directions.
A Gaussian Process is completely defined by its mean function $m(\pmb{x})$ and its covariance function (kernel) $k(\pmb{x},\pmb{x}')$. The mean function $m(\pmb{x})$ corresponds to the mean vector $\pmb{\mu}$ of a Gaussian distribution whereas the covariance function $k(\pmb{x}, \pmb{x}')$ corresponds to the covari...
https://shortscience.org/paper?bibtexKey=conf/ac/Rasmussen03#fweberling1995
https://shortscience.org/paper?bibtexKey=conf/ac/Rasmussen03#fweberling1995Tue, 21 Apr 2020 20:05:41 +00001903.08254journals/corr/abs-1903-082543Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context VariablesRobert MüllerRakelly et al. propose a method to do off-policy meta reinforcement learning (rl). The method achieves a 20-100x improvement on sample efficiency compared to on-policy meta rl like MAML+TRPO.
The key difficulty for offline meta rl arises from the meta-learning assumption, that meta-training and meta-test time match. However during test time the policy has to explore and sees as such on-policy data which is in contrast to the off-policy data that should be used at meta-training. The key contrib...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1903-08254#robertmueller
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1903-08254#robertmuellerTue, 21 Apr 2020 08:39:21 +000010.1093/bioinformatics/bty5732Predicting protein–protein interactions through sequence-based deep learningCodyWildMost of the interesting mechanics within living things are mediated by interactions between proteins, making it important and useful to have good predictive models of whether proteins will interact with one another, for validating possible interaction graph structures.
Prior methods for this problem - which takes as its input sequence representations of two proteins, and outputs a probability of interaction - have pursued different ideas for how to combine information from the two proteins. On...
https://shortscience.org/paper?bibtexKey=10.1093/bioinformatics/bty573#decodyng
https://shortscience.org/paper?bibtexKey=10.1093/bioinformatics/bty573#decodyngTue, 21 Apr 2020 06:36:31 +00001906.05374journals/corr/1906.053743Meta-Learning via Learned LossRobert MüllerBechtle et al. propose meta learning via learned loss ($ML^3$) and derive and empirically evaluate the framework on classification, regression, model-based and model-free reinforcement learning tasks.
The problem is formalized as learning parameters $\Phi$ of a meta loss function $M_\phi$ that computes loss values $L_{learned} = M_{\Phi}(y, f_{\theta}(x))$. Following the outer-inner loop meta algorithm design the learned loss $L_{learned}$ is used to update the parameters of the learner in the...
https://shortscience.org/paper?bibtexKey=journals/corr/1906.05374#robertmueller
https://shortscience.org/paper?bibtexKey=journals/corr/1906.05374#robertmuellerMon, 20 Apr 2020 16:28:20 +00001802.04364journals/corr/abs-1802-043642Junction Tree Variational Autoencoder for Molecular Graph GenerationCodyWildPrior to this paper, most methods that used machine learning to generate molecular blueprints did so using SMILES representations - a string format with characters representing different atoms and bond types. This preference came about because ML had existing methods for generating strings that could be built on for generating SMILES (a particular syntax of string). However, an arguably more accurate and fundamental way of representing molecules is as graphs (with atoms as nodes and bonds as edg...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1802-04364#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1802-04364#decodyngMon, 20 Apr 2020 04:48:28 +00001705.10843journals/corr/GuimaraesSFA172Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation ModelsCodyWildThis paper's proposed method, the cleverly named ORGAN, combines techniques from GANs and reinforcement learning to generate candidate molecular sequences that incentivize desirable properties while still remaining plausibly on-distribution.
Prior papers I've read on molecular generation have by and large used approaches based in maximum likelihood estimation (MLE) - where you construct some distribution over molecular representations, and maximize the probability of your true data under that ...
https://shortscience.org/paper?bibtexKey=journals/corr/GuimaraesSFA17#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/GuimaraesSFA17#decodyngSat, 18 Apr 2020 04:57:12 +0000journals/jcheminf/OlivecronaBEC172Molecular de-novo design through deep reinforcement learningCodyWildOver the past few days, I've been reading about different generative neural networks being tried out for molecular generation. So far this has mostly focused on latent variable space models like autoencoders, but today I shifted attention to a different approach rooted in reinforcement learning. The goal of most of these methods is 1) to build a generative model that can sample plausible molecular structures, but more saliently 2) specifically generate molecules optimized to exhibit some propert...
https://shortscience.org/paper?bibtexKey=journals/jcheminf/OlivecronaBEC17#decodyng
https://shortscience.org/paper?bibtexKey=journals/jcheminf/OlivecronaBEC17#decodyngFri, 17 Apr 2020 06:00:27 +00001908.09791journals/corr/abs-1908-097912Once for All: Train One Network and Specialize it for Efficient Deploymentameroyer**Summary**: The goal of this work is to propose a "Once-for-all” (OFA) network: a large network which is trained such that its subnetworks (subsets of the network with smaller width, convolutional kernel sizes, shallower units) are also trained towards the target task. This allows to adapt the architecture to a given budget at inference time while preserving performance.
**Elastic Parameters.**
The goal is to train a large architecture that contains several well-trained subnetworks with dif...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1908-09791#ameroyer
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1908-09791#ameroyerThu, 16 Apr 2020 17:48:55 +00001610.02415journals/corr/Gomez-Bombarelli163Automatic chemical design using a data-driven continuous representation of moleculesCodyWildI'll admit that I found this paper a bit of a letdown to read, relative to expectations rooted in its high citation count, and my general excitement and interest to see how deep learning could be brought to bear on molecular design. But before a critique, let's first walk through the mechanics of how the authors' approach works.
The method proposed is basically a very straightforward Variational Auto Encoder, or VAE. It takes in a textual SMILES string representation of a molecular structure,...
https://shortscience.org/paper?bibtexKey=journals/corr/Gomez-Bombarelli16#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/Gomez-Bombarelli16#decodyngWed, 15 Apr 2020 03:11:44 +0000journals/iacr/BrakerskiV112Efficient Fully Homomorphic Encryption from (Standard) LWEquaxtonBrakerski and Vaikuntanathan introduce a fully homomorphic encryption scheme (FHE) based solely on the decisional learning with errors (LWE) security assumptions. Moving away from the relatively obscure mathematics of ideal lattices. They introduce relinearization and modulus switching techniques for dimensionality reduction and for removing the “squashing” step of Craig Gentry’s FHE scheme. BV11 and other similar schemes are commonly referred to as “Second generation FHE” schemes.
R...
https://shortscience.org/paper?bibtexKey=journals/iacr/BrakerskiV11#jyang772
https://shortscience.org/paper?bibtexKey=journals/iacr/BrakerskiV11#jyang772Mon, 13 Apr 2020 02:16:23 +00001704.01212journals/corr/GilmerSRVD174Neural Message Passing for Quantum ChemistryCodyWildIn the years before this paper came out in 2017, a number of different graph convolution architectures - which use weight-sharing and order-invariant operations to create representations at nodes in a graph that are contextualized by information in the rest of the graph - had been suggested for learning representations of molecules. The authors of this paper out of Google sought to pull all of these proposed models into a single conceptual framework, for the sake of better comparing and testing ...
https://shortscience.org/paper?bibtexKey=journals/corr/GilmerSRVD17#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/GilmerSRVD17#decodyngFri, 10 Apr 2020 06:05:16 +00001708.09259journals/corr/1708.092592Efficient Convolutional Network Learning using Parametric Log based Dual-Tree Wavelet ScatterNethanoch kremerScatterNets incorporates geometric knowledge of images to produce discriminative and invariant (translation and rotation) features i.e. edge information. The same outcome as CNN's first layers hold. So why not replace that first layer/s with an equivalent, fixed, structure and let the optimizer find the best weights for the CNN with its leading-edge removed.
The main motivations of the idea of replacing the first convolutional, ReLU and pooling layers of the CNN with a two-layer parametric log-b...
https://shortscience.org/paper?bibtexKey=journals/corr/1708.09259#hanochkremer
https://shortscience.org/paper?bibtexKey=journals/corr/1708.09259#hanochkremerThu, 09 Apr 2020 12:05:38 +000010.1111/j.1467-9965.1991.tb00002.x3Universal PortfoliosquaxtonCover's Universal Portfolio is an information-theoretic portfolio optimization algorithm that utilizes constantly rebalanced porfolios (CRP). A CRP is one in which the distribution of wealth among stocks in the portfolio remains the same from period to period. Universal Portfolio strictly performs rebalancing based on historical pricing, making no assumptions about the underlying distribution of the prices.
The wealth achieved by a CRP over n periods is:
$S_n(b,x^n) = \displaystyle \prod_{n}...
https://shortscience.org/paper?bibtexKey=10.1111/j.1467-9965.1991.tb00002.x#jyang772
https://shortscience.org/paper?bibtexKey=10.1111/j.1467-9965.1991.tb00002.x#jyang772Wed, 08 Apr 2020 23:17:22 +00001611.03199journals/corr/Altae-TranRPP162Low Data Drug Discovery with One-shot LearningCodyWildThe goal of one-shot learning tasks is to design a learning structure that can perform a new task (or, more canonically, add a new class to an existing task) using only one a small number of examples of the new task or class. So, as an example: you'd want to be able to take one positive and one negative example of a given task and correctly classify subsequent points as either positive or negative. A common way of achieving this, and the way that the paper builds on, is to learn a parametrized f...
https://shortscience.org/paper?bibtexKey=journals/corr/Altae-TranRPP16#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/Altae-TranRPP16#decodyngWed, 08 Apr 2020 05:11:54 +00001703.00564journals/corr/WuRFGGPLP172MoleculeNet: A Benchmark for Molecular Machine LearningCodyWildThis is a paper released by the creators of the DeepChem library/framework, explaining the efforts they've put into facilitating straightforward and reproducible testing of new methods. They advocate for consistency between tests on three main axes.
1. On the most basic level, that methods evaluate on the same datasets
2. That they use canonical train/test splits
3. That they use canonical metrics.
To that end, they've integrated a framework they call "MoleculeNet" into DeepChem, containing ...
https://shortscience.org/paper?bibtexKey=journals/corr/WuRFGGPLP17#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/WuRFGGPLP17#decodyngTue, 07 Apr 2020 04:15:48 +00001509.09292journals/corr/DuvenaudMAGHAA153Convolutional Networks on Graphs for Learning Molecular FingerprintsCodyWildIf you read modern (that is, 2018-2020) papers using deep learning on molecular inputs, almost all of them use some variant of graph convolution. So, I decided to go back through the citation chain and read the earliest papers that thought to apply this technique to molecules, to get an idea of lineage of the technique within this domain.
This 2015 paper, by Duvenaud et al, is the earliest one I can find. It focuses the entire paper on comparing differentiable, message-passing networks to the ...
https://shortscience.org/paper?bibtexKey=journals/corr/DuvenaudMAGHAA15#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/DuvenaudMAGHAA15#decodyngMon, 06 Apr 2020 16:05:21 +00001603.00856journals/corr/KearnesMBPR163Molecular Graph Convolutions: Moving Beyond FingerprintsCodyWildThis paper was published after the 2015 Duvenaud et al paper proposing a differentiable alternative to circular fingerprints of molecules: substituting out exact-match random hash functions to identify molecular structures with learned convolutional-esque kernels. As far as I can tell, the Duvenaud paper was the first to propose something we might today recognize as graph convolutions on atoms. I hoped this paper would build on that one, but it seems to be coming from a conceptually different di...
https://shortscience.org/paper?bibtexKey=journals/corr/KearnesMBPR16#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/KearnesMBPR16#decodyngMon, 06 Apr 2020 06:30:03 +00001608.04844journals/corr/1608.048442Boosting Docking-based Virtual Screening with Deep LearningCodyWildMy objective in reading this paper was to gain another perspective on, and thus a more well-grounded view of, machine learning scoring functions for docking-based prediction of ligand/protein binding affinity. As quick background context, these models are useful because many therapeutic compounds act by binding to a target protein, and it can be valuable to prioritize doing wet lab testing on compounds that are predicted to have a stronger binding affinity. Docking systems work by predicting the...
https://shortscience.org/paper?bibtexKey=journals/corr/1608.04844#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/1608.04844#decodyngSat, 04 Apr 2020 05:03:25 +00001910.02845journals/corr/1910.028453Combining docking pose rank and structure with deep learning improves protein-ligand binding mode predictionCodyWildThis paper focuses on the application of deep learning to the docking problem within rational drug design. The overall objective of drug design or discovery is to build predictive models of how well a candidate compound (or "ligand") will bind with a target protein, to help inform the decision of what compounds are promising enough to be worth testing in a wet lab. Protein binding prediction is important because many small-molecule drugs, which are designed to be small enough to get through cell...
https://shortscience.org/paper?bibtexKey=journals/corr/1910.02845#decodyng
https://shortscience.org/paper?bibtexKey=journals/corr/1910.02845#decodyngFri, 03 Apr 2020 05:28:05 +00001910.01708journals/corr/1910.017083Benchmarking Batch Deep Reinforcement Learning AlgorithmsRobert MüllerThe authors propose a unified setting to evaluate the performance of batch reinforcement learning algorithms. The proposed benchmark is discrete and based on the popular Atari Domain. The authors review and benchmark several current batch RL algorithms against a newly introduced version of BCQ (Batch Constrained Deep Q Learning) for discrete environments.
Note in line 5 that the policy chooses actions with a restricted argmax operation, eliminating actions that have not enough support in the...
https://shortscience.org/paper?bibtexKey=journals/corr/1910.01708#robertmueller
https://shortscience.org/paper?bibtexKey=journals/corr/1910.01708#robertmuellerFri, 27 Mar 2020 14:40:38 +0000conf/icml/FujimotoMP193Off-Policy Deep Reinforcement Learning without ExplorationRobert MüllerInteracting with the environment comes sometimes at a high cost, for example in high stake scenarios like health care or teaching. Thus instead of learning online, we might want to learn from a fixed buffer $B$ of transitions, which is filled in advance from a behavior policy.
The authors show that several so called off-policy algorithms, like DQN and DDPG fail dramatically in this pure off-policy setting.
They attribute this to the extrapolation error, which occurs in the update of a value es...
https://shortscience.org/paper?bibtexKey=conf/icml/FujimotoMP19#robertmueller
https://shortscience.org/paper?bibtexKey=conf/icml/FujimotoMP19#robertmuellerWed, 25 Mar 2020 10:07:55 +00002003.05856journals/corr/2003.058565Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual LearningMassimo Cacciadisclaimer: I'm the first author of the paper
## TL;DR
We have made a lot of progress on catastrophic forgetting within the standard evaluation protocol,
i.e. sequentially learning a stream of tasks and testing our models' capacity to remember them all.
We think it's time a new approach to Continual Learning (CL), coined OSAKA, which is more aligned with real-life applications of CL. It brings CL closer to Online Learning and Open-World Learning.
main modifications we propose:
- bring CL cl...
https://shortscience.org/paper?bibtexKey=journals/corr/2003.05856#mcaccia
https://shortscience.org/paper?bibtexKey=journals/corr/2003.05856#mcacciaThu, 19 Mar 2020 16:41:59 +00001905.12558journals/corr/1905.125583Limitations of the Empirical Fisher Approximation for Natural Gradient DescentRobert MüllerThe authors analyse in the very well written paper the relation between Fisher $F(\theta) = \sum_n \mathbb{E}_{p_{\theta}(y \vert x)}[\nabla_{\theta} \log(p_{\theta}(y \vert x_n))\nabla_{\theta} \log(p_{\theta}(y \vert x_n))^T] $ and empirical Fisher $\bar{F}(\theta) = \sum_n [\nabla_{\theta} \log(p_{\theta}(y_n \vert x_n))\nabla_{\theta} \log(p_{\theta}(y_n \vert x_n))^T] $, which has recently seen a surge in interest. . The definitions differ in that $y_n$ is a training label instead of a samp...
https://shortscience.org/paper?bibtexKey=journals/corr/1905.12558#robertmueller
https://shortscience.org/paper?bibtexKey=journals/corr/1905.12558#robertmuellerThu, 19 Mar 2020 08:59:52 +0000conf/nips/BafnaMV183Thwarting Adversarial Examples: An L_0-Robust Sparse Fourier TransformDavid StutzBafna et al. show that iterative hard thresholding results in $L_0$ robust Fourier transforms. In particular, as shown in Algorithm 1, iterative hard thresholding assumes a signal $y = x + e$ where $x$ is assumed to be sparse, and $e$ is assumed to be sparse. This translates to noise $e$ that is bounded in its $L_0$ norm, corresponding to common adversarial attacks such as adversarial patches in computer vision. Using their algorithm, the authors can provably reconstruct the signal, specifically...
https://shortscience.org/paper?bibtexKey=conf/nips/BafnaMV18#davidstutz
https://shortscience.org/paper?bibtexKey=conf/nips/BafnaMV18#davidstutzSat, 14 Mar 2020 23:31:48 +00001809.08758journals/corr/1809.087582Low Frequency Adversarial PerturbationDavid StutzGuo et al. propose to augment black-box adversarial attacks with low-frequency noise to obtain low-frequency adversarial examples as shown in Figure 1. To this end, the boundary attack as well as the NES attack are modified to sample from a low-frequency Gaussian distribution instead from Gaussian noise directly. This is achieved through an inverse discrete cosine transform as detailed in the paper.
Figure 1: Example of a low-frequency adversarial example.
Also find this summary at [davidstut...
https://shortscience.org/paper?bibtexKey=journals/corr/1809.08758#davidstutz
https://shortscience.org/paper?bibtexKey=journals/corr/1809.08758#davidstutzSat, 14 Mar 2020 23:27:21 +000010.1109/cvprw.2018.002123Semantic Adversarial ExamplesDavid StutzHosseini and Poovendran propose semantic adversarial examples by randomly manipulating hue and saturation of images. In particular, in an iterative algorithm, hue and saturation are randomly perturbed and projected back to their valid range. If this results in mis-classification the perturbed image is returned as the adversarial example and the algorithm is finished; if not, another iteration is run. The result is shown in Figure 1. As can be seen, the structure of the images is retained while h...
https://shortscience.org/paper?bibtexKey=10.1109/cvprw.2018.00212#davidstutz
https://shortscience.org/paper?bibtexKey=10.1109/cvprw.2018.00212#davidstutzSat, 14 Mar 2020 23:17:20 +0000conf/icml/KarmonZG182LaVAN: Localized and Visible Adversarial NoiseDavid StutzKarmon et al. propose a gradient-descent based method for obtaining adversarial patch like localized adversarial examples. In particular, after selecting a region of the image to be modified, several iterations of gradient descent are run in order to maximize the probability of the target class and simultaneously minimize the probability in the true class. After each iteration, the perturbation is masked to the patch and projected onto the valid range of [0,1] for images. On ImageNet, the author...
https://shortscience.org/paper?bibtexKey=conf/icml/KarmonZG18#davidstutz
https://shortscience.org/paper?bibtexKey=conf/icml/KarmonZG18#davidstutzSat, 14 Mar 2020 23:13:00 +00001904.00759journals/corr/abs-1904-007592Adversarial camera stickers: A physical camera-based attack on deep learning systemsDavid StutzLi et al. propose camera stickers that when computed adversarially and physically attached to the camera leads to mis-classification. As illustrated in Figure 1, these stickers are realized using circular patches of uniform color. These individual circular stickers are computed in a gradient-descent fashion by optimizing their location, color and radius. The influence of the camera on these stickers is modeled realistically in order to guarantee success.
Figure 1: Illustration of adversarial s...
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1904-00759#davidstutz
https://shortscience.org/paper?bibtexKey=journals/corr/abs-1904-00759#davidstutzSat, 14 Mar 2020 22:54:51 +000010.1109/wacv.2019.001432Local Gradients Smoothing: Defense Against Localized Adversarial AttacksDavid StutzNaseer et al. propose to smooth local gradients as defense against adversarial patches. In particular, as illustrated in Figure 1, the local image gradient is computed through convolution. Then, in local, overlapping windows, the gradients are set to zero if the total sum of absolute gradient values exceeds a specific threshold. The remaining gradient map is supposed to indicate regions where it is likely that adversarial patches can be found. Using this gradient map, the image is smoothed, i.e....
https://shortscience.org/paper?bibtexKey=10.1109/wacv.2019.00143#davidstutz
https://shortscience.org/paper?bibtexKey=10.1109/wacv.2019.00143#davidstutzSat, 14 Mar 2020 22:51:20 +0000