![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server. Historically, the default way to do Federated Learning was with an algorithm called FedSGD, which worked by: - Sending a copy of the current model to each device/client - Calculating a gradient update to be applied on top of that current model given a batch of data sampled from the client's device - Sending that gradient back to the central server - Averaging those gradients and applying them all at once to a central model The authors note that this approach is equivalent to one where a single device performs a step of gradient descent locally, sends the resulting *model* back to the the central server, and performs model averaging by averaging the parameter vectors there. Given that, and given their observation that, in federated learning, communication of gradients and models is generally much more costly than the computation itself (since the computation happens across so many machines), they ask whether the communication required to get to a certain accuracy could be better optimized by performing multiple steps of gradient calculation and update on a given device, before sending the resulting model back to a central server to be average with other clients models. Specifically, their algorithm, FedAvg, works by: - Dividing the data on a given device into batches of size B - Calculating an update on each batch and applying them sequentially to the starting model sent over the wire from the server - Repeating this for E epochs Conceptually, this should work perfectly well in the world where data from each batch is IID - independently drawn from the same distribution. But that is especially unlikely to be true in the case of federated learning, when a given user and device might have very specialized parts of the data space, and prior work has shown that there exist pathological cases where averaged models can perform worse than either model independently, even *when* the IID condition is met. The authors experiment empirically ask the question whether these sorts of pathological cases arise when simulating a federated learning procedure over MNIST and a language model trained on Shakespeare, trying over a range of hyperparameters (specifically B and E), and testing the case where data is heavily non-IID (in their case: where different "devices" had non-overlapping sets of digits). https://i.imgur.com/xq9vi8S.png They show that, in both the IID and non-IID settings, they are able to reach their target accuracy, and are able to do so with many fewer rounds of communciation than are required by FedSGD (where an update is sent over the wire, and a model sent back, for each round of calculation done on the device.) The authors argue that this shows the practical usefulness of a Federated Learning approach that does more computation on individual devices before updating, even in the face of theoretical pathological cases. ![]() |
[link]
In certain classes of multi-agent cooperation games, it's useful for agents to be able to coordinate on future actions, which is an obvious use case for having a communication channel between the two players. However, prior work in multi-agent RL has shown that it's surprisingly hard to train agents that (1) consistently learn to use a communication channel in a way that is informative rather than random, and (2) if they do use communication, can come to a common grounding on the meaning of symbols, to use them in an effective way. This paper suggests the straightforward and clever approach of, instead of just having agents communicate using arbitrary vectors produced as part of a policy, having those communication vectors be directly linked to the content of an agent's observations. Specifically, this is done by taking the encoding of the image that is used for making policy decisions, and passing that encoding through an autoencoder, using the bottleneck at the middle of the autoencoder as the communication vector sent to other agents. This structure incentivizes the agent to generate communication vectors that are intrinsically grounded in the observation, enforcing a certain level of consistency that the authors hope makes it easier for the other agent to follow and interpret the communication. https://i.imgur.com/u9OAZm8.png Empirically, there seem to be fairly compelling evidence that this autoencoder-based form of grounding is more stable and thus more mutually learnable than learning from RL alone. The authors even found that adding RL training to the autoencoder-based training deteriorated performance. ![]() |
[link]
This strikes me as a really straightforward, clever, and exciting paper that uses the supervision intrinsic in the visual, audio, and text streams of a video to train a shared multimodal model. The basic premise is: - Tokenize all three modalities into a sequence of embedding tokens. For video, split into patches, and linearly project the voxels of these patches to get a per-token representation. For audio, a similar strategy but with waveform patches. For text, the normal per-token embedding is done. Combine this tokenization with a modality-specific positional encoding. - Run all of these embeddings through a Transformer with shared weights for all three modalities - Take the final projected CLS representation for each the video patches, and perform contrastive learning against both an aligned audio patch, and an aligned text region. This contrastive loss is calculated by, for each pair, projecting into a shared space (video and audio each project into a shared audio-video space, video and text each project into a shared video-text space, with specific projection weights), and then doing a normal contrastive setup where positive pairs come either from a direct alignment of audio and video, or from a soft "nearest neighbors" alignment of text with video, to account for not all video snippets containing text One technique that was fun in its simplicity was the author's DropToken strategy, which basically just said "hey, we have a high-resolution input, what if we just randomly dropped tokens within our sequence to reduce the S^2 sequence length cost. This obviously leads to some performance cost, but they found it not very dramatic. Experimental results were all-around impressive, achieving SOTA on a number of modality-specific tasks (action prediction in video, audio prediction) with their cross-modality model. ![]() |
[link]
This work expands on prior techniques for designing models that can both be stored using fewer parameters, and also execute using fewer operations and less memory, both of which are key desiderata for having trained machine learning models be usable on phones and other personal devices. The main contribution of the original MobileNets paper was to introduce the idea of using "factored" decompositions of Depthwise and Pointwise convolutions, which separate the procedures of "pull information from a spatial range" and "mix information across channels" into two distinct steps. In this paper, they continue to use this basic Depthwise infrastructure, but also add a new design element: the inverted-residual linear bottleneck. The reasoning behind this new layer type comes from the observation that, often, the set of relevant points in a high-dimensional space (such as the 'per-pixel' activations inside a conv net) actually lives on a lower-dimensional manifold. So, theoretically, and naively, one could just try to use lower dimensional internal representations to map the dimensionality of that assumed manifold. However, the authors argue that ReLU non-linearities kill information (because of the region where all inputs are mapped to zero), and so having layers contain only the number of dimensions needed for the manifold would mean that you end up with too-few dimensions after the ReLU information loss. However, you need to have non-linearities somewhere in the network in order to be able to learn complex, non-linear functions. So, the authors suggest a method to mostly use smaller-dimensional representations internally, but still maintain ReLus and the network's needed complexity. https://i.imgur.com/pN4d9Wi.png - A lower-dimensional output is "projected up" into a higher dimensional output - A ReLu is applied on this higher-dimensional layer - That layer is then projected down into a smaller-dimensional layer, which uses a linear activation to avoid information loss - A residual connection between the lower-dimensional output at the beginning and end of the expansion This way, we still maintain the network's non-linearity, but also replace some of the network's higher-dimensional layers with lower-dimensional linear ones ![]() |
[link]
I'm a little embarrassed that I'm only just now reading what seems like a fairly important paper from a year and a half ago, but, in my defense, March 2020 was not the best time for keeping up with the literature in a disciplined way. Anyhow, musings aside: this paper proposes an alternative training procedure for large language models, which the authors claim result in models that reach strong performance more efficiently than previous BERT, XLNet, or RoBERTa baselines. As some background context, the previously-canonical Masked Learning Model (MLM) task works by: - Replacing some percentage of tokens with a [MASK] indicator - Using the final-layer representation at the locations of those [MASK]s to predict the true input token - Using as a training signal the Maximum Likelihood of that prediction, or, how high the model's predicted probability on the true input. ELECTRA authors argue that there are a few notable disadvantages to this structure, if your goal is to train useful representations for downstream tasks. Firstly, your loss only consists of information (i.e. the true token) from the tokens you randomly masked, so a good amount of the data is going in some sense unused (except as context). Secondly, learning a full generative model of language requires a lot of data and training time, and it may not be all that beneficial for performance on your downstream tasks of interest. As an alternative, they propose: - Co-learning a (small) generator, trained in typical MLM fashion, alongside a discriminator. Randomly select tokens from the input to replace with fake tokens drawn from the distribution of the discriminator - The goal of the discriminator is to distinguish the true tokens from the fake ones. (minor note: if the generator happens to get lucky and generate the real token, that's counted as a "real" rather than "fake" token, even though it was generated by a generator). This uses more of the training data in the loss, since you can ask "real or fake" for every token in the input data, not (obviously) just the ones that are actually fake - An important note for those familiar with GANs is that the generator isn't trained to confuse the discriminator (as is GAN-standard), but is simply trained with it's own maximum likelihood loss, independent of the discriminator's performance. They argue, and show fairly convincingly, that ELECTRA is able to reach a higher efficiency-to-performance trade-off curve compared to BERT - matching the performance of previous models with notably less training, and outperforming them with comparable amounts of training. They go on to perform a few ablations, some of which felt more convincing than others. The most confusing ablation, which I'm not sure if I just misunderstood, was meant to ask how much of the value of ELECTRA came from calculating its loss over all the tokens in the training data, rather than just the masked ones. So, they tried just calculating the loss for the masked/replaced tokens. The resulting discriminator performs very poorly downstream. But, I find this a little odd as a design choice, since couldn't the discriminator learn to almost always predict that a replaced token was fake, since the only way it could be otherwise would be if the generator got lucky and produced the true word? They also did the (more sensible, to me) experiment of calculating the loss on a similarly-sized percentage of tokens, but not fully overlapping with the replacement mask, and that did more similarly to base ELECTRA. They also tested training a combined MLM/ELECTRA loss, where generated tokens were used in lieu of masking, and the full-sized MLM generator predicts the true token at every point in the sequence (which could be the token it gets as input, or could not be, in the case of a replacement). That model performed more closely to ELECTRA than BERT, which suggests that the efficiency gain of calculating a loss on every element in the training set was more important in practice than the gain from focusing a discriminator more directly on what was valuable for downstream tasks, rather than generating. ![]() |