![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
**Introduction** * Salient region is an area where a striking combination of features in images is perceived at the first observation. * These features combined to make a region that has a significant distinction with other areas in the image. * This paper presents a map of saliency using linear combination of high-dimensional color representation space. **Related work** * Present work in saliency detection is divided into two groups which are taking into account low-level features, and statistical learning methods. Both approach has variety of results with no clear significance which performs better in saliency detection. * In the first group of taking into account of low-level features, there are approaches on saliency region detection based on color contrast, Markovian approach, and multi-scale saliency based on superpixels that have drawbacks on the pres.ence of high-contrast edges, unpreserved boundary, and segmentation parameters. * Using learning-based methods present are region detection saliency based on the regional descriptor, using graph-based, and sample patches. **Approach** * Paper provides a method of mapping low dimensional color of RGB, CIELab, HSV spaces into high dimensional color. * Method identify the presence of superpixel saliency features using the SLIC method with 500 number of superpixels. * Feature vector is defined by the location of superpixels and then combined with color which proceeds to color space representation computation. * Histogram features are later combined with 8 bins in each of the histogram and distance is computed using the Chi-squared method. * Global and local contrast are used using Euclidean distance with variance parameters of 0.25. * Histogram of gradients method with 31 dimensions is used to extract shape and texture features in the superpixels. As backgrounds tend to have more blur features in the pixel, separation of backgrounds is done using Singular Value Feature which algorithm is following the concept of Eigen images with weight acquired by using Singular Value Decomposition. * 75 dimension of feature vectors are then obtained for saliency detection. These feature maps are combined from all the superpixel operations mentioned. Features included are location features, color features, color histogram features, color contrast features, shape and texture features. * Regression algorithm is then applied to feature vectors. As large databases including 2000 images in a dataset are tested, the best present approach for the algorithm is a random forest. An unlimited number of nodes are applied with 200 trees. * To construct the transformation into high-dimensional color, a Trimap is built by dividing the starting saliency map to three different regions that are 2x2, 3x3, and 4x4. Then seven level of adaptive thresholding is applied to each subregion which then produces 21-level of the locally thresholded map. * Global threshold construct the trimap by the division of local levels, when the levels are more and equal to 18 levels, the map is defined by 1 and when the levels are less and equal than 6 levels, map is defined by 0. https://i.imgur.com/4eF1UZd.png * High dimensional color is used as it combines all the benefits of color representation properties. Nonlinear color representation in RGB and its gradients, CIELab color representation, and saturation and hue in HSV are combined. This produces 11 color channels. * Linear RGB values are not combined as it counteracts with YUV/YIQ color space. * Then gamma correction in the range of 0.5 to 2 with 0.5 intervals is applied, generating 44 high dimensional color vector. * Background that had been separated from the foreground in trimap is then examined to approximate color coefficients' linear combination using the minimum of least square problem of two matrices. The first matrix is a vector with binary value with 0 represents background and 1 represents foreground. The second matrix is color samples multiplied by a coefficient vector. * The map of the saliency region is then built by the summation of color samples applied to the coefficient vector estimation. The whole method is iterated three times to construct a stable saliency map. * The map is then refined by spatial information by adding more weights to pixels that contain foreground region. It is defined by exponential to the power of a parameter 0.5 applied to the minimum Euclidean distance for both foreground and background. It is concluded in the image below: https://i.imgur.com/k9heDiw.png * Algorithm is evaluated using three datasets that contain 5000 images in the MSRA dataset, 1000 images in the ECCSD dataset, and 643 multiple objects in the Interactive co-segmentation dataset. * Eight saliency region detection are compared using precision and recall measurements. The algorithms compared are LC, HC, LR, SF, HS, GMR, and DRFI. * F-measurement to evaluate performance is computed by application of precision-recall, and a quadratic parameter added by one that is divided by the summation of precision, recall, and the quadratic parameter. The quadratic parameter has a value of 0.3. * Algorithm performance placed at the second-best compared to all other methods. **Notes on The Paper** * Paper provides an algorithm that performs well for detecting saliency based on colors by generating features that are high dimensional from low-level features. * In high dimensional color space, salient regions are able to be separated from the background (Rahman, I et al. 2016). * If the algorithm is further developed using classifier, it is then able to integrate richer features as a higher dimension is present (Borji, A et al. 2017). ![]() |
[link]
**Introduction :** Corners, as feature cues in an image, is defined by two edge intersections. This definition has benefit in allowing precise location of the cue, although it is only valid when locality is maintained and the result is similar to the real corner location **Related work:** * Corner detector method present are SIFT global tracker that is using Difference of Gaussians and SURF that is using Haar wavelet to approximate Hessian determinat. These methods have drawback in high computation * FAST method is a corner detector that performs better than conventional corner detectors such as Harris corner detection. Drawbacks of this method is that it depends on the environment, therefore decision tree needs to be always constructed from scratch (ID3 greedy algorithm) **Approach:** * In detecting corners, discretized circle pixels are used to be compared with the center area. A circle with 3.4 diameter is used as test mask * Based on accelerated segment test, pixel is identified to be a corner when there are a number of pixel that has different value, either darker or lighter, than threshold of center pixel * The number of pixel that is used in this paper is the same as FAST-9 method, which nine size segment as it has the best performance * This number has the property to detect corners with some standard such that when different viewpoints are applied, it has the highest number of repeatability to detect corners correctly * FAST algorithm is building ternary tree which has three possible states that are darker, lighter, or similar, added by unknown state that leads to $4^N$ configuration space, while FAST-9 has dissimilarity in the circle's thickness which is increased to 3 pixels * Proposed corner detection describes the algorithm by testing one of the pixel with one question to pose. If a scenario of a pixel is given and question is evaluated, the next pixel and question in query is determined by the response * This algorithm expands the configuration space by adding not lighter and not darker which produces a binary tree representation in which evaluation can be done in each of the nodes, therefore the configuration space has the size of $6^N$ * Memory is accessed by three types of cost which are second pixel comparison, same row pixel test, and other pixel test * To make decision tree optimal, a method that has resemblance with backward induction is formulated. Configuration space is explored using Depth First Search algorithm where each leaf is described whether it has satisfy corner criteria by accelerated segment test * Cost of each node is calculated by summation of minimum cost in each pair of child that has the positive or negative value with probability of pixel nodes both parent and child * The algorithm calculates the probability of an image whether it has homogeneous and structured areas then proceed to make a decision tree according to it. The distribution of probability contains three probabilities that are mirror state probability and similar state probability https://i.imgur.com/lIHh7gL.png * Algorithm is improved to make it generic by jumping from one optimized tree to another based on the configuration of the respected leaf once it has termination condition based on the corner criteria * This switching method has no cost as it happens in a leaf. However, it affects the time by one test delay. Thus, AGAST can only perform worse than FAST once it needs to jump between either homogeneous or heterogeneous pixels consecutively * Corner detection is compared using three pixel mask sizes that are 4, 12 and 16. Comparison is done by applying Gaussian noise and blur to database of variety viewpoints of checkerboard database * As limited computation using conventional computers are present, four state configuration space is used using three different arc lengths that are 9, 10 and 11 * As the mask and arc length enlarged, the more features found. Small arc defines the location of the real corner * Large pattern leads to slower computation as it has more process to be done in order to detect the corner or evaluating features and it needs more advanced computing memory * Smaller pattern can also lead to elimination of feature detection as the features located near to each other, therefore in smaller feature, post processing technique is removed * When database is added by Gaussian blur and noise, the combination of pixel mask of 16 and arc length of 9 is more robust against the disturbance, therefore, repeatability is controlled by arc * Decision tree is also evaluated by computing the response time of corner detection. This is done by calculating the tests number that has possible pixel arrangement in a mask * Pixels arrangements that have close similarities are grouped. By observing the standard deviation, the group with large number of pixels that are alike has unbalanced decision tree. This happens as possible pixel arrangements are limited. Observed as follows: https://i.imgur.com/DLqCXhy.png * However, when one tree is used, adaptive tree has better performance than the conventional method * In comparison for trees that has different weight, the algorithm that jumps between trees or AGAST is optimized when the value of the weights are 0.1 and 1/3 * Performance of AGAST is also tested by comparison with FAST-9 where uniform probability distribution are used to make the trees * Both algorithm are tested on five scenes, which are laboratory, outdoor, indoor, aerial, and medical. The optimized tree can speed the corner detection up for 13$\%$ while AGAST can speed up in the range of 23$\%$ to over 30$\%$ **Paper contributions** * Paper provides an improved FAST corner detector that is able to dynamically adapt with its environment while also processing an image input * AGAST method is improving its predecessors method in saving time spent to process the image and also memory that is being used * It is also able to find more keypoints in corner detection ![]() |
[link]
Given the tasks that RL is typically used to perform, it can be easy to equate the problem of reinforcement learning with "learning dynamically, online, as you take actions in an environment". And while this does represent most RL problems in the literature, it is possible to learn a reinforcement learning system in an off-policy way (read: trained off of data that the policy itself didn't collect), and there can be compelling reasons to prefer this approach. In this paper, which seeks to train a chatbot to learn from implicit human feedback in text interactions, the authors note prior bad experiences with Microsoft's Tay bot, and highlight the value of being able to test and validate a learned model offline, rather than have it continue to learn in a deployment setting. This problem, of learning a RL model off of pre-collected data, is known as batch RL. In this setting, the batch is collected by simply using a pretrained language model to generate interactions with a human, and then extracting reward from these interactions to train a Q learning system once the data has been collected. If naively applied, Q learning (a good approach for off-policy problems, since it directly estimates the value of states and actions rather than of a policy) can lead to some undesirable results in a batch setting. An interesting one, that hadn't occurred to me, was the fact that Q learning translates its (state, action) reward model into a policy by taking the action associated with the highest reward. This is a generally sensible thing to do if you've been able to gather data on all or most of a state space, but it can also bias the model to taking actions that it has less data for, because high-variance estimates will tend make up a disproportionate amount of maximum values of any estimated distribution. One approach to this is to learn two separate Q functions, and take the minimum over them, and then take the max of that across actions (in this case: words in a sentence being generated). The idea here is that low-data, high-variance parts of state space might have one estimate be high, but might have the other be low, because high variance. However, it's costly to train and run two separate models. Instead, the authors here propose the simpler solution of training a single model with dropout, and using multiple "draws" from that model to simulate a distribution over Q value estimates. This will have a similar effect of penalizing actions whose estimate varies across different dropout masks (which can be hand-wavily thought of as different models). The authors also add a term to their RL training that penalizes divergence from the initial language model that they used to collect the data, and also that is the initialization point for the parameters of the model. This is done via KL-divergence control: the model is penalized for outputting a distribution over words that is different in distributional-metric terms from what the language model would have output. This makes it costlier for the model to diverge from the pretrained model, and should lead to it only happening in cases of convincing high reward. Out of these two approaches, it seems like the former is more convincing to me as a general-purpose method to use in batch RL settings. The latter is definitely something I would have expected to work well (and, indeed, KL-controlled models performed much better in empirical tests in the paper!), but more simply because language modeling is hard, and I would expect it to be good to constrain a model to be close to realistic outputs, since the sentiment-based reward signal won't reward realism directly. This seems more like something generally useful for avoiding catastrophic forgetting when switching from an old task to a new one (language modeling to sentiment modeling), rather than a particularly batch-RL-centric innovation. https://i.imgur.com/EmInxOJ.png An interesting empirical observation of this paper is that models without language-model control end up drifting away from realism, and repeatedly exploit part of the reward function that, in addition to sentiment, gave points for asking questions. By contrast, the KL-controlled models appear to have avoided falling into this local minimum, and instead generated realistic language that was polite and empathetic. (Obviously this is still a simplified approximation of what makes a good chat bot, but it's at least a higher degree of complexity in its response to reward). Overall, I quite enjoyed this paper, both for its thoughtfulness and its clever application of engineering to use RL for a problem well outside of its more typical domain. ![]() |
[link]
At a high level, this paper is a massive (34 pgs!) and highly-resourced study of many nuanced variations of language pretraining tasks, to see which of those variants produce models that transfer the best to new tasks. As a result, it doesn't lend itself *that* well to being summarized into a central kernel of understanding. So, I'm going to do my best to pull out some high-level insights, and recommend you read the paper in more depth if you're working particularly in language pretraining and want to get the details. The goals here are simple: create a standardized task structure and a big dataset, so that you can use the same architecture across a wide range of objectives and subsequent transfer tasks, and thus actually compare tasks on equal footing. To that end, the authors created a huge dataset by scraping internet text, and filtering it according to a few common sense criteria. This is an important and laudable task, but not one with a ton of conceptual nuance to it. https://i.imgur.com/5z6bN8d.png A more interesting structural choice was to adopt a unified text to text framework for all of the tasks they might want their pretrained model to transfer to. This means that the input to the model is always a sequence of tokens, and so is the output. If the task is translation, the input sequence might be "translate english to german: build a bed" and the the desired output would be that sentence in German. This gets particularly interesting as a change when it comes to tasks where you're predicting relationships of words within sentences, and would typically have a categorical classification loss, which is changed here to predicting the word of the correct class. This restructuring doesn't seem to hurt performance, and has the nice side effect that you can directly use the same model as a transfer starting point for all tasks, without having to add additional layers. Some of the transfer tasks include: translation, sentiment analysis, summarization, grammatical checking of a sentence, and checking the logical relationship between claims. All tested models followed a transformer (i.e. fully attentional) architecture. The authors tested performance along many different axes. A structural variation was the difference between an encoder-decoder architecture and a language model one. https://i.imgur.com/x4AOkLz.png In both cases, you take in text and predict text, but in an encoder-decoder, you have separate models that operate on the input and output, whereas in a language model, it's all seen as part of a single continuous sequence. They also tested variations in what pretraining objective is used. The most common is simple language modeling, where you predict words in a sentence given prior or surrounding ones, but, inspired by the success of BERT, they also tried a number of denoising objectives, where an original sentence was corrupted in some way (by dropping tokens and replacing them with either masks, nothing, or random tokens, dropping individual words vs contiguous spans of words) and then the model had to predict the actual original sentence. https://i.imgur.com/b5Eowl0.png Finally, they performed testing as to the effect of dataset size and number of training steps. Some interesting takeaways: - In almost all tests, the encoder-decoder architecture, where you separately build representations of your input and output text, performs better than a language model structure. This is still generally (though not as consistently) true if you halve the number of parameters in the encoder-decoder, suggesting that there's some structural advantage there beyond just additional parameter count. - A denoising, BERT-style objective works consistently better than a language modeling one. Within the set of different kinds of corruption, none work obviously and consistently better across tasks, though some have a particular advantage at a given task, and some are faster to train with due to different lengths of output text. - Unsurprisingly, more data and bigger models both lead to better performance. Somewhat interestingly, training with less data but the same number of training iterations (such that you see the same data multiple times) seems to be fine up to a point. This potentially gestures at an ability to train over a dataset a higher number of times without being as worried about overfitting. - Also somewhat unsurprisingly, training on a dataset that filters out HTML, random lorem-ipsum web text, and bad words performs meaningfully better than training on one that doesn't ![]() |
[link]
Coming from the perspective of the rest of machine learning, a somewhat odd thing about reinforcement learning that often goes unnoticed is the fact that, in basically all reinforcement learning, performance of an algorithm is judged by its performance on the same environment it was trained on. In the parlance of ML writ large: training on the test set. In RL, most of the focus has historically been on whether automatic systems would be able to learn a policy from the state distribution of a single environment, already a fairly hard task. But, now that RL has had more success in the single-environment case, there comes the question: how can we train reinforcement algorithms that don't just perform well on a single environment, but over a range of environments. One lens onto this question is that of meta-learning, but this paper takes a different approach, and looks at how straightforward regularization techniques pulled from the land of supervised learning can (or can't straightforwardly) be applied to reinforcement learning. In general, the regularization techniques discussed here are all ways of reducing the capacity of the model, and preventing it from overfitting. Some ways to reduce capacity are: - Apply L2 weight penalization - Apply dropout, which handicaps the model by randomly zeroing out neurons - Use Batch Norm, which uses noisy batch statistics, and increases randomness in a way that, similar to above, deteriorates performance - Use an information bottleneck: similar to a VAE, this approach works by learning some compressed representation of your input, p(z|x), and then predicting your output off of that z, in a way that incentivizes your z to be informative (because you want to be able to predict y well) but also penalizes too much information being put in it (because you penalize differences between your learned p(z|x) distribution and an unconditional prior p(z) ). This pushes your model to use its conditional-on-x capacity wisely, and only learn features if they're quite valuable in predicting y However, the paper points out that there are some complications in straightforwardly applying these techniques to RL. The central one is the fact that in (most) RL, the distribution of transitions you train on comes from prior iterations of your policy. This means that a noisier and less competent policy will also leave you with less data to train on. Additionally, using a noisy policy can increase variance, both by making your trained policy more different than your rollout policy (in an off-policy setting) and by making your estimate of the value function higher-variance, which is problematic because that's what you're using as a target training signal in a temporal difference framework. The paper is a bit disconnected in its connection between justification and theory, and makes two broad, mostly distinct proposals: 1. The most successful (though also the one least directly justified by the earlier-discussed theoretical difficulties of applying regularization in RL) is an information bottleneck ported into a RL setting. It works almost the same as the classification-model one, except that you're trying to increase the value of your actions given compressed-from-state representation z, rather than trying to increase your ability to correctly predict y. The justification given here is that it's good to incentivize RL algorithms in particular to learn simpler, more compressible features, because they often have such poor data and also training signal earlier in training 2. SNI (Selective Noise Injection) works by only applying stochastic aspects of regularization (sampling from z in an information bottleneck, applying different dropout masks, etc) to certain parts of the training procedure. In particular, the rollout used to collect data is non-stochastic, removing the issue of noisiness impacting the data that's collected. They then do an interesting thing where they calculate a weighted mixture of the policy update with a deterministic model, and the update with a stochastic one. The best performing of these that they tested seems to have been a 50/50 split. This is essentially just a knob you can turn on stochasticity, to trade off between the regularizing effect of noise and the variance-increasing-negative effect of it. https://i.imgur.com/fi0dHgf.png https://i.imgur.com/LLbDaRw.png Based on my read of the experiments in the paper, the most impressive thing here is how well their information bottleneck mechanism works as a way to improve generalization, compared to both the baseline and other regularization approaches. It does look like there's some additional benefit to SNI, particularly in the CoinRun setting, but very little in the MultiRoom setting, and in general the difference is less dramatic than the difference from using the information bottleneck. ![]() |