ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Fast Model Editing at Scale
Eric Mitchell and Charles Lin and Antoine Bosselut and Chelsea Finn and Christopher D. Manning
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.LG, cs.AI, cs.CL
more

[link] Summary by Joseph Paul Cohen 1 year ago

The goal of this work is to edit the model’s weights given new edit pairs ($x_e, y_e$) at test time. They achieve this by learning a "model editor network" that takes a fine tuning gradient computed from ($x_e, y_e$) and transforms this into a weight update. 

$$ f(\nabla W_l) \rightarrow \tilde\nabla W_l$$ 

The editor network is parameterized by the layer that it is predicting using a FiLM style scale and shift.

The editor network is trained on a small set of examples ($D^{tr}_{edit}$). The paper states that this dataset contains edits that are similar to the "the types of edits that will be made." which is interesting because it introduces generalization limitations to the potential edits.

An extra loss term is used to prevent unintended changes for other inputs to the model (called $x_{loc}$). This is achieved with the following loss that will maintain the predictions to be the same value.
$$L_{loc} = KL(p_{\theta_W}(\cdot | x_{loc}) \| p_{\theta_\tilde{W}}(\cdot | x_{loc}))$$
 

Some intuition for why this works is editor network $f$ approximates full dataset gradient from just a single example so it is more efficient. It can reduce the change of elements of the weight matrix which were disruptive to the loss when it was trained, information that requires many training examples to uncover.

arxiv.org
arxiv-vanity.com
scholar.google.com

Object-Centric Learning with Slot Attention
Francesco Locatello and Dirk Weissenborn and Thomas Unterthiner and Aravindh Mahendran and Georg Heigold and Jakob Uszkoreit and Alexey Dosovitskiy and Thomas Kipf
arXiv e-Print archive - 2020 via Local arXiv
Keywords: cs.LG, cs.CV, stat.ML
more

[link] Summary by ngthanhtinqn 2 years ago

The Slot Attention module maps from a set of N input feature vectors to a set of K
output vectors that we refer to as slots. Each vector in this output set can, for example, describe
an object or an entity in the input.

https://i.imgur.com/81nh508.png

arxiv.org
arxiv-vanity.com
scholar.google.com

CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition
Shreyank N Gowda and Laura Sevilla-Lara and Frank Keller and Marcus Rohrbach
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.CV
more

[link] Summary by ngthanhtinqn 2 years ago

This paper aims to do zero-shot action recognition which uses cluster-based representation.
Concretely, it uses REINFORCE algorithm which is a Reinforcement Learning algorithm to optimize the centroids and the reward signal is the classification scores.
https://i.imgur.com/gWyJLX0.png

arxiv.org
arxiv-vanity.com
scholar.google.com

Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang and Li Dong and Wenhui Wang and Yaru Hao and Saksham Singhal and Shuming Ma and Tengchao Lv and Lei Cui and Owais Khan Mohammed and Qiang Liu and Kriti Aggarwal and Zewen Chi and Johan Bjorck and Vishrav Chaudhary and Subhojit Som and Xia Song and Furu Wei
arXiv e-Print archive - 2023 via Local arXiv
Keywords: cs.CL, cs.CV
more

[link] Summary by ngthanhtinqn 2 years ago

This paper is about Multimodal Large Language Model (MLLM). In this paper, they proposed an MLLM model called KOSMOS-1 that can do instruction following, VQA, IQ-testing, visual dialog, etc.

https://i.imgur.com/9P3Vuse.png

https://i.imgur.com/HcYtbdD.png

The input of this model is image-caption pairs and interleaved data of images and texts.
https://i.imgur.com/LL4HiM3.png

The input data will be fed into an embedding module to encode the data into vectors, then the vectors will be fed into a Transformer Decoder. Then the decoder will predict the next token based on the previous context.

I think this large model's downside is that it can only predict phrases, not images.

arxiv.org
arxiv-vanity.com
scholar.google.com

Women also Snowboard: Overcoming Bias in Captioning Models (Extended Abstract)
Lisa Anne Hendricks and Kaylee Burns and Kate Saenko and Trevor Darrell and Anna Rohrbach
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CV
more

[link] Summary by ngthanhtinqn 2 years ago

This paper is to reduce gender bias in the captioning model. Concretely, traditional captioning models tend to rely on contextual cues, so they usually predict incorrect captions for an image that contains people.

To reduce gender bias, they introduced a new $Equalizer$ model that contains two losses:

(1) Appearance Confusion Loss: When it is hard to tell if there is a man or a woman in the image, the model should provide a fair probability of predicting a man or a woman.

To define that loss, first, they define a confusion function, which indicates how likely a next predicted word belongs to a set of woman words or a set of man words.

https://i.imgur.com/oI6xswy.png
Where, $w~_{t}$ is the next predicted word, $G_{w}$ is the set of woman words, $G_{m}$ is the set of man words.

And the Loss is defined as the normal cross-entropy loss multiplied by the Confusion function.
https://i.imgur.com/kLpROse.png


(2) Confident Loss: When it is easy to recognize a man or a woman in an image, this loss encourages the model to predict gender words correctly.

In this loss, they also defined in-confidence functions, there are two in-confidence functions, the first one is the in-confidence function for man words, and the second one is for woman words. These two functions are the same.

https://i.imgur.com/4stFjac.png

This function tells that if the model is confident when predicting a gender (ex. woman), then the value of the in-confidence function for woman words should be low.

Then, the confidence loss function is as follows:
https://i.imgur.com/1pRgDir.png