Welcome to ShortScience.org! |
[link]
The goal of this work is to edit the model’s weights given new edit pairs ($x_e, y_e$) at test time. They achieve this by learning a "model editor network" that takes a fine tuning gradient computed from ($x_e, y_e$) and transforms this into a weight update. $$ f(\nabla W_l) \rightarrow \tilde\nabla W_l$$ The editor network is parameterized by the layer that it is predicting using a FiLM style scale and shift. The editor network is trained on a small set of examples ($D^{tr}_{edit}$). The paper states that this dataset contains edits that are similar to the "the types of edits that will be made." which is interesting because it introduces generalization limitations to the potential edits. An extra loss term is used to prevent unintended changes for other inputs to the model (called $x_{loc}$). This is achieved with the following loss that will maintain the predictions to be the same value. $$L_{loc} = KL(p_{\theta_W}(\cdot | x_{loc}) \| p_{\theta_\tilde{W}}(\cdot | x_{loc}))$$ Some intuition for why this works is editor network $f$ approximates full dataset gradient from just a single example so it is more efficient. It can reduce the change of elements of the weight matrix which were disruptive to the loss when it was trained, information that requires many training examples to uncover. |
[link]
The Slot Attention module maps from a set of N input feature vectors to a set of K output vectors that we refer to as slots. Each vector in this output set can, for example, describe an object or an entity in the input. https://i.imgur.com/81nh508.png |
[link]
This paper aims to do zero-shot action recognition which uses cluster-based representation. Concretely, it uses REINFORCE algorithm which is a Reinforcement Learning algorithm to optimize the centroids and the reward signal is the classification scores. https://i.imgur.com/gWyJLX0.png |
[link]
This paper is about Multimodal Large Language Model (MLLM). In this paper, they proposed an MLLM model called KOSMOS-1 that can do instruction following, VQA, IQ-testing, visual dialog, etc. https://i.imgur.com/9P3Vuse.png https://i.imgur.com/HcYtbdD.png The input of this model is image-caption pairs and interleaved data of images and texts. https://i.imgur.com/LL4HiM3.png The input data will be fed into an embedding module to encode the data into vectors, then the vectors will be fed into a Transformer Decoder. Then the decoder will predict the next token based on the previous context. I think this large model's downside is that it can only predict phrases, not images. |
[link]
This paper is to reduce gender bias in the captioning model. Concretely, traditional captioning models tend to rely on contextual cues, so they usually predict incorrect captions for an image that contains people. To reduce gender bias, they introduced a new $Equalizer$ model that contains two losses: (1) Appearance Confusion Loss: When it is hard to tell if there is a man or a woman in the image, the model should provide a fair probability of predicting a man or a woman. To define that loss, first, they define a confusion function, which indicates how likely a next predicted word belongs to a set of woman words or a set of man words. https://i.imgur.com/oI6xswy.png Where, $w~_{t}$ is the next predicted word, $G_{w}$ is the set of woman words, $G_{m}$ is the set of man words. And the Loss is defined as the normal cross-entropy loss multiplied by the Confusion function. https://i.imgur.com/kLpROse.png (2) Confident Loss: When it is easy to recognize a man or a woman in an image, this loss encourages the model to predict gender words correctly. In this loss, they also defined in-confidence functions, there are two in-confidence functions, the first one is the in-confidence function for man words, and the second one is for woman words. These two functions are the same. https://i.imgur.com/4stFjac.png This function tells that if the model is confident when predicting a gender (ex. woman), then the value of the in-confidence function for woman words should be low. Then, the confidence loss function is as follows: https://i.imgur.com/1pRgDir.png |