Summaries from European Conference on Computer Vision on ShortScience.org

doi.org
sci-hub
scholar.google.com

UNITER: UNiversal Image-TExt Representation Learning
Chen, Yen-Chun and Li, Linjie and Yu, Licheng and Kholy, Ahmed El and Ahmed, Faisal and Gan, Zhe and Cheng, Yu and Liu, Jingjing
European Conference on Computer Vision - 2020 via Local Bibsonomy
Keywords: dblp

[link] Summary by ngthanhtinqn 2 years ago

This paper is to design a generalized multimodal architecture that can solve all Vision language tasks.
Concretely, they will pre-train their model on 4 main tasks (MLM, ITM, WRA, MRM) and will evaluate various downstream tasks (VQA, VCR, NLVR).

https://i.imgur.com/IG7suDj.png

As shown in Fig 1, UNITER first encodes image regions (visual features and bounding box features) and textual words (tokens and positions) into a common embedding space with Image Embedder and Text Embedder.
Then, a Transformer module is applied to learn generalizable contextualized embeddings for each region and each word.

The contribution is two-fold:
(1) Masked language/region modeling is conditioned on full observation of image/text, rather than applying joint random masking to both modalities
(2) Introducing a novel WRA pre-training task via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions. 
Intuitively, OT-based learning aims to optimize distribution matching by minimizing the cost of transporting one distribution to another. In our context, we aim to minimize the cost of transporting the embeddings from image regions to words in a sentence (and vice versa), thus optimizing toward better cross-modal alignment.

arxiv.org
scholar.google.com

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Mildenhall, Ben and Srinivasan, Pratul P. and Tancik, Matthew and Barron, Jonathan T. and Ramamoorthi, Ravi and Ng, Ren
- 2020 via Local Bibsonomy
Keywords: view_sythesis, deeplearning, eccv20, neural_rendering

[link] Summary by CodyWild 3 years ago

This summary builds extensively on my prior summary of SIRENs, so if you haven't read that summary or the underlying paper yet, I'd recommend doing that first! 

At a high level, the idea of SIRENs is to use a neural network to learn a compressed, continuous representation of an image, where the neural network encodes a mapping from (x, y) to the pixel value at that location, and the image can be reconstructed (or, potentially, expanded in size) by sampling from that function across the full range of the image. To do this effectively, they use sinusoidal activation functions, which let them match not just the output of the neural network f(x, y) to the true image, but also the first and second derivatives of the neural network to the first and second derivatives of the true image, which provides a more robust training signal. 

NERFs builds on this idea, but instead of trying to learn a continuous representation of an image (mapping from 2D position to 3D RGB), they try to learn a continuous representation of a scene, mapping from position (specified with with three coordinates) and viewing direction (specified with two angles) to the RGB color at a given point in a 3D grid (or "voxel", analogous to "pixel"), as well as the *density* or opacity of that point. 

Why is this interesting? Because if you have a NERF that has learned a good underlying function of a particular 3D scene, you can theoretically take samples of that scene from arbitrary angles, even angles not seen during training. It essentially functions as a usable 3D model of a scene, but one that, because it's stored in the weights of a neural network, and specified in a continuous function, is far smaller than actually storing all the values of all the voxels in a 3D scene (the authors give an example of 5MB vs 15GB for a NERF vs a full 3D model). To get some intuition for this, consider that if you wanted to store the curve represented by a particular third-degree polynomial function between 0 and 10,000 it would be much more space-efficient to simply store the 3 coefficients of that polynomial, and be able to sample from it at your desired granularity at will, rather than storing many empirically sampled points from along the curve. 

https://i.imgur.com/0c33YqV.png

How is a NERF model learned?

- The (x, y, z) position of each point is encoded as a combination of sine-wave, Fourier-style curves of increasingly higher frequency. This is similar to the positional encoding used by transformers. In practical turns, this means a location in space will be represented as a vector calculated as [some point on a low-frequency curve, some point on a slightly higher frequency curve..., some point on the highest-frequency curve]. This doesn't contain any more *information* than the (x, y, z) representation, but it does empirically seem to help training when you separate the frequencies like this
- You take a dataset of images for which viewing direction is known, and simulate sending a ray through the scene in that direction, hitting some line (or possibly tube?) of voxels on the way. You calculate the perceived color at that point, which is an integral of the color information and density/opacity returned by your model, for each point. Intuitively, if you have a high opacity weight early on, that part of the object blocks any voxels further in the ray, whereas if the opacity weight is lower, more of the voxels behind will contribute to the overall effective color perceived. You then compare these predicted perceived colors to the actual colors captured by the 2D image, and train on the prediction error.
- (One note on sampling: the paper proposes a hierarchical sampling scheme to help with sampling efficiently along the ray, first taking a course sample, and then adding additional samples in regions of high predicted density)
- At the end of training, you have a network that hopefully captures the information from *that particular scene*. A notable downside of this approach is that it's quite slow for any use cases that require training on many scenes, since each individual scene network takes about 1-2 days of GPU time to train