![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
# Object detection system overview. https://i.imgur.com/vd2YUy3.png 1. takes an input image, 2. extracts around 2000 bottom-up region proposals, 3. computes features for each proposal using a large convolutional neural network (CNN), and then 4. classifies each region using class-specific linear SVMs. * R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. * On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%. ## There is a 2 challenges faced in object detection 1. localization problem 2. labeling the data 1 localization problem : * One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method. * An alternative is to build a sliding-window detector. considered adopting a sliding-window approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the sliding-window paradigm. 2 labeling the data: * The conventional solution to this problem is to use unsupervised pre-training, followed by supervise fine-tuning * supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL), * fine-tuning for detection improves mAP performance by 8 percentage points. * Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs) ## Object detection with R-CNN This system consists of three modules * The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. * The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. * The third module is a set of class specific linear SVMs. Module design 1 Region proposals * which detect mitotic cells by applying a CNN to regularly-spaced square crops. * use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute). * the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) 2 Feature extraction. * extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of the CNN * Features are computed by forward propagating a mean-subtracted 227x227 RGB image through five convolutional layers and two fully connected layers. * warp all pixels in a tight bounding box around it to the required size * The feature matrix is typically 2000x4096 3 Test time detection * At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). * warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. * Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over union (IoU) overlap with a higher scoring selected region larger than a learned threshold. ## Training 1 Supervised pre-training: * pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding box labels are not available for this data) 2 Domain-specific fine-tuning. * use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001. 3 Object category classifiers. * use intersection-over union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3. * Once features are extracted and training labels are applied, we optimize one linear SVM per class. * adopt the standard hard negative mining method to fit large training data in memory. ### Results on PASCAL VOC 201012 1 VOC 2010 * compared against four strong baselines including SegDPM, DPM, UVA, Regionlets. * Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster https://i.imgur.com/0dGX9b7.png 2 ILSVRC2013 detection. * ran R-CNN on the 200-class ILSVRC2013 detection dataset * R-CNN achieves a mAP of 31.4% https://i.imgur.com/GFbULx3.png #### Performance layer-by-layer, without fine-tuning 1 pool5 layer * which is the max pooled output of the network’s fifth and final convolutional layer. *The pool5 feature map is 6 x6 x 256 = 9216 dimensional * each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input 2 Layer fc6 * fully connected to pool5 * it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases 3 Layer fc7 * It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification #### Performance layer-by-layer, with fine-tuning * CNN’s parameters fine-tuned on PASCAL. * fine-tuning increases mAP by 8.0 % points to 54.2% ### Network architectures * 16-layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet. * RCNN with O-Net substantially outperforms R-CNN with TNet, increasing mAP from 58.5% to 66.0% * drawback in terms of compute time, with in terms of compute time, with than T-Net. 1 The ILSVRC2013 detection dataset * dataset is split into three sets: train (395,918), val (20,121), and test (40,152) #### CNN features for segmentation. * full R-CNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap. * fg R-CNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction. * full+fg R-CNN: The third strategy (full+fg) simply concatenates the full and fg features https://i.imgur.com/n1bhmKo.png ![]()
1 Comments
|
[link]
FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative). The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image. ## LMNN Large Margin Nearest Neighbor (LMNN) is learning a pseudo-metric $$d(x, y) = (x -y) M (x -y)^T$$ where $M$ is a positive-definite matrix. The only difference between a pseudo-metric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold. ## Curriculum Learning: Triplet selection Show simple examples first, then increase the difficulty. This is done by selecting the triplets. They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low. They want to have $$||f(x_i^a) - f(x_i^p)||_2^2 + \alpha < ||f(x_i^a) - f(x_i^n)||_2^2$$ where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$. ## Tasks * **Face verification**: Is this the same person? * **Face recognition**: Who is this person? ## Datasets * 99.63% accuracy on Labeled FAces in the Wild (LFW) * 95.12% accuracy on YouTube Faces DB ## Network Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13) and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14). ## See also * [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma) ![]() |
[link]
The last two years have seen a number of improvements in the field of language model pretraining, and BERT - Bidirectional Encoder Representations from Transformers - is the most recent entry into this canon. The general problem posed by language model pretraining is: can we leverage huge amounts of raw text, which aren’t labeled for any specific classification task, to help us train better models for supervised language tasks (like translation, question answering, logical entailment, etc)? Mechanically, this works by either 1) training word embeddings and then using those embeddings as input feature representations for supervised models, or 2) treating the problem as a transfer learning problem, and fine-tune to a supervised task - similar to how you’d fine-tune a model trained on ImageNet by carrying over parameters, and then training on your new task. Even though the text we’re learning on is strictly speaking unsupervised (lacking a supervised label), we need to design a task on which we calculate gradients in order to train our representations. For unsupervised language modeling, that task is typically structured as predicting a word in a sequence given prior words in that sequence. Intuitively, in order for a model to do a good job at predicting the word that comes next in a sentence, it needs to have learned patterns about language, both on grammatical and semantic levels. A notable change recently has been the shift from learning unconditional word vectors (where the word’s representation is the same globally) to contextualized ones, where the representation of the word is dependent on the sentence context it’s found in. All the baselines discussed here are of this second type. The two main baselines that the BERT model compares itself to are OpenAI’s GPT, and Peters et al’s ELMo. The GPT model uses a self-attention-based Transformer architecture, going through each word in the sequence, and predicting the next word by calculating an attention-weighted representation of all prior words. (For those who aren’t familiar, attention works by multiplying a “query” vector with every word in a variable-length sequence, and then putting the outputs of those multiplications into a softmax operator, which inherently gets you a weighting scheme that adds to one). ELMo uses models that gather context in both directions, but in a fairly simple way: it learns one deep LSTM that goes from left to right, predicting word k using words 0-k-1, and a second LSTM that goes from right to left, predicting word k using words k+1 onward. These two predictions are combined (literally: just summed together) to get a representation for the word at position k. https://i.imgur.com/2329e3L.png BERT differs from prior work in this area in several small ways, but one primary one: instead of representing a word using only information from words before it, or a simple sum of prior information and subsequent information, it uses the full context from before and after the word in each of its multiple layers. It also uses an attention-based Transformer structure, but instead of incorporating just prior context, it pulls in information from the full sentence. To allow for a model that actually uses both directions of context at a time in its unsupervised prediction task, the authors of BERT slightly changed the nature of that task: it replaces the word being predicted with the “mask” token, so that even with multiple layers of context aggregation on both sides, the model doesn’t have any way of knowing what the token is. By contrast, if it weren’t masked, after the first layer of context aggregation, the representations of other words in the sequence would incorporate information about the predicted word k, making it trivial, if another layer were applied on top of that first one, for the model to directly have access to the value it’s trying to predict. This problem can either be solved by using multiple layers, each of which can only see prior context (like GPT), by learning fully separate L-R and R-L models, and combining them at the final layer (like ELMo) or by masking tokens, and predicting the value of the masked tokens using the full remainder of the context. This task design crucially allows for a multi-layered bidirectional architecture, and consequently a much richer representation of context in each word’s pre-trained representation. BERT demonstrates dramatic improvements over prior work when fine tuned on a small amount of supervised data, suggesting that this change added substantial value. ![]() |
[link]
[Machine learning is a nuanced, complicated, intellectually serious topic...but sometimes it’s refreshing to let ourselves be a bit less serious, especially when it’s accompanied by clear, cogent writing on a topic. This particular is a particularly delightful example of good-natured silliness - from the dataset name HellaSwag to figures containing cartoons of BERT and ELMO representing language models - combined with interesting science.] https://i.imgur.com/CoSeh51.png This paper tackles the problem of natural language comprehension, which asks: okay, our models can generate plausible looking text, but do they actually exhibit what we would consider true understanding of language? One natural structure of task for this is to take questions or “contexts”, and, given a set of possible endings or completion, pick the correct one. Positive examples are relatively easy to come by: adjacent video captions and question/answer pairs from WikiHow are two datasets used in this paper. However, it’s more difficult to come up with *negative* examples. Even though our incorrect endings won’t be a meaningful continuation of the sentence, we want them to be “close enough” that we can feel comfortable attributing a model’s ability to pick the correct answer as evidence of some meaningful kind of comprehension. As an obvious failure mode, if the alternative multiple choice options were all the same word repeated ten times, that would be recognizable as being not real language, and it would be easy for a model to select the answer with the distributional statistics of real language, but it wouldn’t prove much. Typically failure modes aren’t this egregious, but the overall intuition still holds, and will inform the rest of the paper: your ability to test comprehension on a given dataset is a function of how contextually-relevant and realistic your negative examples are. Previous work (by many of the same authors as are on this paper), proposed a technique called Adversarial Filtering to try to solve this problem. In Adversarial Filtering, a generative language model is used to generate possible many endings conditioned on the input context, to be used as negative examples. Then, a discriminator is trained to predict the correct ending given the context. The generated samples that the discriminator had the highest confidence classifying as negative are deemed to be not challenging enough comparisons, and they’re thrown out and replaced with others from our pool of initially-generated samples. Eventually, once we’ve iterated through this process, we have a dataset with hopefully realistic negative examples. The negative examples are then given to humans to ensure none of them are conceptually meaningful actual endings to the sentence. The dataset released by the earlier paper, which used as it’s generator a relatively simple LSTM model, was called Swag. However, the authors came to notice that the performance of new language models (most centrally BERT) on this dataset might not be quite what it appears: its success rate of 86% only goes down to 76% if you don’t give the classifier access to the input context, which means it can get 76% (off of a random baseline of 25%, with 4 options) simply by evaluating which endings are coherent as standalone bits of natural language, without actually having to understand or even see the context. Also, shuffling the words in the words in the possible endings had a similarly small effect: the authors are able to get BERT to perform at 60% accuracy on the SWAG dataset with no context, and with shuffled words in the possible answers, meaning it’s purely selecting based on the distribution of words in the answer, rather than on the meaningfully-ordered sequence of words. https://i.imgur.com/f6vqJWT.png The authors overall conclusion with this is: this adversarial filtering method is only as good as the generator, and, more specifically, the training dynamic between the generator that produces candidate endings, and the discriminator that filters them. If the generator is too weak, the negative examples can be easily detected as fake by a stronger model, but if the generator is too strong, then the discriminator can’t get good enough to usefully contribute by weeding samples out. They demonstrate this by creating a new version of Swag, which they call HellaSwag (for the expected acronym-optimization reasons), with a GPT generator rather than the simpler one used before: on this new dataset, all existing models get relatively poor results (30-40% performance). However, the authors’ overall point wasn’t “we’ve solved it, this new dataset is the end of the line,” but rather a call in the future to be wary, and generally aware that with benchmarks like these, especially with generated negative examples, it’s going to be a constantly moving target as generation systems get better. ![]() |
[link]
This paper focuses on the application of deep learning to the docking problem within rational drug design. The overall objective of drug design or discovery is to build predictive models of how well a candidate compound (or "ligand") will bind with a target protein, to help inform the decision of what compounds are promising enough to be worth testing in a wet lab. Protein binding prediction is important because many small-molecule drugs, which are designed to be small enough to get through cell membranes, act by binding to a specific protein within a disease pathway, and thus blocking that protein's mechanism. The formulation of the docking problem, as best I understand it, is: 1. A "docking program," which is generally some model based on physical and chemical interactions, takes in a (ligand, target protein) pair, searches over a space of ways the ligand could orient itself within the binding pocket of the protein (which way is it facing, where is it twisted, where does it interact with the protein, etc), and ranks them according to plausibility 2. A scoring function takes in the binding poses (otherwise known as binding modes) ranked the highest, and tries to predict the affinity strength of the resulting bond, or the binary of whether a bond is "active". The goal of this paper was to interpose modern machine learning into the second step, as alternative scoring functions to be applied after the pose generation . Given the complex data structure that is a highly-ranked binding pose, the hope was that deep learning would facilitate learning from such a complicated raw data structure, rather than requiring hand-summarized features. They also tested a similar model structure on the problem of predicting whether a highly ranked binding pose was actually the empirically correct one, as determined by some epsilon ball around the spatial coordinates of the true binding pose. Both of these were binary tasks, which I understand to be 1. Does this ranked binding pose in this protein have sufficiently high binding affinity to be "active"? This is known as the "virtual screening" task, because it's the relevant task if you want to screen compounds in silico, or virtually, before doing wet lab testing. 2. Is this ranked binding pose the one that would actually be empirically observed? This is known as the "binding mode prediction" task The goal of this second task was to better understand biases the researchers suspected existed in the underlying dataset, which I'll explain later in this post. The researchers used a graph convolution architecture. At a (very) high level, graph convolution works in a way similar to normal convolution - in that it captures hierarchies of local patterns, in ways that gradually expand to have visibility over larger areas of the input data. The distinction is that normal convolution defines kernels over a fixed set of nearby spatial coordinates, in a context where direction (the pixel on top vs the pixel on bottom, etc) is meaningful, because photos have meaningful direction and orientation. By contrast, in a graph, there is no "up" or "down", and a given node doesn't have a fixed number of neighbors (whereas a fixed pixel in 2D space does), so neighbor-summarization kernels have to be defined in ways that allow you to aggregate information from 1) an arbitrary number of neighbors, in 2) a manner that is agnostic to orientation. Graph convolutions are useful in this problem because both the summary of the ligand itself, and the summary of the interaction of the posed ligand with the protein, can be summarized in terms of graphs of chemical bonds or interaction sites. Using this as an architectural foundation, the authors test both solo versions and ensembles of networks: https://i.imgur.com/Oc2LACW.png 1. "L" - A network that uses graph convolution to summarize the ligand itself, with no reference to the protein it's being tested for binding affinity with 2. "LP" - A network that uses graph convolution on the interaction points between the ligand and protein under the binding pose currently being scored or predicted 3. "R" - A simple network that takes into account the rank assigned to the binding pose by the original docking program (generally used in combination with one of the above). The authors came to a few interesting findings by trying different combinations of the above model modules. First, they found evidence supporting an earlier claim that, in the dataset being used for training, there was a bias in the positive and negative samples chosen such that you could predict activity of a ligand/protein binding using *ligand information alone.* This shouldn't be possible if we were sampling in an unbiased way over possible ligand/protein pairs, since even ligands that are quite effective with one protein will fail to bind with another, and it shouldn't be informationally possible to distinguish the two cases without protein information. Furthermore, a random forest on hand-designed features was able to perform comparably to deep learning, suggesting that only simple features are necessary to perform the task on this (bias and thus over-simplified) Specifically, they found that L+LP models did no better than models of L alone on the virtual screening task. However, the binding mode prediction task offered an interesting contrast, in that, on this task, it's impossible to predict the output from ligand information alone, because by construction each ligand will have some set of binding modes that are not the empirically correct one, and one that is, and you can't distinguish between these based on ligand information alone, without looking at the actual protein relationship under consideration. In this case, the LP network did quite well, suggesting that deep learning is able to learn from ligand-protein interactions when it's incentivized to do so. Interestingly, the authors were only able to improve on the baseline model by incorporating the rank output by the original docking program, which you can think of an ensemble of sorts between the docking program and the machine learning model. Overall, the authors' takeaways from this paper were that (1) we need to be more careful about constructing datasets, so as to not leak information through biases, and (2) that graph convolutional models are able to perform well, but (3) seem to be capturing different things than physics-based models, since ensembling the two together provides marginal value. ![]() |