![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server. Historically, the default way to do Federated Learning was with an algorithm called FedSGD, which worked by: - Sending a copy of the current model to each device/client - Calculating a gradient update to be applied on top of that current model given a batch of data sampled from the client's device - Sending that gradient back to the central server - Averaging those gradients and applying them all at once to a central model The authors note that this approach is equivalent to one where a single device performs a step of gradient descent locally, sends the resulting *model* back to the the central server, and performs model averaging by averaging the parameter vectors there. Given that, and given their observation that, in federated learning, communication of gradients and models is generally much more costly than the computation itself (since the computation happens across so many machines), they ask whether the communication required to get to a certain accuracy could be better optimized by performing multiple steps of gradient calculation and update on a given device, before sending the resulting model back to a central server to be average with other clients models. Specifically, their algorithm, FedAvg, works by: - Dividing the data on a given device into batches of size B - Calculating an update on each batch and applying them sequentially to the starting model sent over the wire from the server - Repeating this for E epochs Conceptually, this should work perfectly well in the world where data from each batch is IID - independently drawn from the same distribution. But that is especially unlikely to be true in the case of federated learning, when a given user and device might have very specialized parts of the data space, and prior work has shown that there exist pathological cases where averaged models can perform worse than either model independently, even *when* the IID condition is met. The authors experiment empirically ask the question whether these sorts of pathological cases arise when simulating a federated learning procedure over MNIST and a language model trained on Shakespeare, trying over a range of hyperparameters (specifically B and E), and testing the case where data is heavily non-IID (in their case: where different "devices" had non-overlapping sets of digits). https://i.imgur.com/xq9vi8S.png They show that, in both the IID and non-IID settings, they are able to reach their target accuracy, and are able to do so with many fewer rounds of communciation than are required by FedSGD (where an update is sent over the wire, and a model sent back, for each round of calculation done on the device.) The authors argue that this shows the practical usefulness of a Federated Learning approach that does more computation on individual devices before updating, even in the face of theoretical pathological cases. ![]() |
[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) ![]() |
[link]
Wu et al. provide a framework (behavior regularized actor critic (BRAC)) which they use to empirically study the impact of different design choices in batch reinforcement learning (RL). Specific instantiations of the framework include BCQ, KL-Control and BEAR. Pure off-policy rl describes the problem of learning a policy purely from a batch $B$ of one step transitions collected with a behavior policy $\pi_b$. The setting allows for no further interactions with the environment. This learning regime is for example in high stake scenarios, like education or heath care, desirable. The core principle of batch RL-algorithms in to stay in some sense close to the behavior policy. The paper proposes to incorporate this firstly via a regularization term in the value function, which is denoted as **value penalty**. In this case the value function of BRAC takes the following form: $ V_D^{\pi}(s) = \sum_{t=0}^{\infty} \gamma ^t \mathbb{E}_{s_t \sim P_t^{\pi}(s)}[R^{pi}(s_t)- \alpha D(\pi(\cdot\vert s_t) \Vert \pi_b(\cdot \vert s_t)))], $ where $\pi_b$ is the maximum likelihood estimate of the behavior policy based upon $B$. This results in a Q-function objective: $\min_{Q} = \mathbb{E}_{\substack{(s,a,r,s') \sim D \\ a' \sim \pi_{\theta}(\cdot \vert s)}}\left[(r + \gamma \left(\bar{Q}(s',a')-\alpha D(\pi(\cdot\vert s) \Vert \pi_b(\cdot \vert s) \right) - Q(s,a) \right] $ and the corresponding policy update: $ \max_{\pi_{\theta}} \mathbb{E}_{(s,a,r,s') \sim D} \left[ \mathbb{E}_{a^{''} \sim \pi_{\theta}(\cdot \vert s)}[Q(s,a^{''})] - \alpha D(\pi(\cdot\vert s) \Vert \pi_b(\cdot \vert s) \right] $ The second approach is **policy regularization** . Here the regularization weight $\alpha$ is set for value-objectives (V- and Q) to zero and is non-zero for the policy objective. It is possible to instantiate for example the following batch RL algorithms in this setting: - BEAR: policy regularization with sample-based kernel MMD as D and min-max mixture of the two ensemble elements for $\bar{Q}$ - BCQ: no regularization but policy optimization over restricted space Extensive Experiments over the four Mujoco tasks Ant, HalfCheetah,Hopper Walker show: 1. for a BEAR like instantiation there is a modest advantage of keeping $\alpha$ fixed 2. using a mixture of a two or four Q-networks ensemble as target value yields better returns that using one Q-network 3. taking the minimum of ensemble Q-functions is slightly better than taking a mixture (for Ant, HalfCeetah & Walker, but not for Hooper 4. the use of value-penalty yields higher return than the policy-penalty 5. no choice for D (MMD, KL (primal), KL(dual) or Wasserstein (dual)) significantly outperforms the other (note that his contradicts the BEAR paper where MMD was better than KL) 6. the value penalty version consistently outperforms BEAR which in turn outperforms BCQ with improves upon a partially trained baseline. This large scale study of different design choices helps in developing new methods. It is however surprising to see, that most design choices in current methods are shown empirically to be non crucial. This points to the importance of agreeing upon common test scenarios within a community to prevent over-fitting new algorithms to a particular setting. ![]() |
[link]
Occasionally, I come across results in machine learning that I'm glad exist, even if I don't fully understand them, precisely because they remind me how little we know about the complicated information architectures we're building, and what kinds of signal they can productively use. This is one such result. The paper tests a method called self-training, and compares it against the more common standard of pre-training. Pre-training works by first training your model on a different dataset, in a supervised way, with the labels attached to that dataset, and then transferring the learned weights on that model model (except for the final prediction head) and using that as initialization for training on your downstream task. Self-training also uses an external dataset, but doesn't use that external data's labels. It works by 1) Training a model on the labeled data from your downstream task, the one you ultimately care about final performance on 2) Using that model to make label predictions (for the label set of your downstream task), for the external dataset 3) Retraining a model from scratch with the combined set of human labels and predicted labels from step (2) https://i.imgur.com/HaJTuyo.png This intuitively feels like cheating; something that shouldn't quite work, and yet the authors find that it equals or outperforms pretraining and self-supervised learning in the setting they examined (transferring from ImageNet as an external dataset to CoCo as a downstream task, and using data augmentations on CoCo). They particularly find this to be the case when they're using stronger data augmentations, and when they have more labeled CoCo data to train with from the pretrained starting point. They also find that self-training outperforms self-supervised (e.g. contrastive) learning in similar settings. They further demonstrate that self-training and pre-training can stack; you can get marginal value from one, even if you're already using the other. They do acknowledge that - because it requires training a model on your dataset twice, rather than reusing an existing model directly - their approach is more computationally costly than the pretrained-Imagenet alternative. This work is, I believe, rooted in the literature on model distillation and student/teacher learning regimes, which I believe has found that you can sometimes outperform a model by training on its outputs, though I can't fully remember the setups used in those works. The authors don't try too hard to give a rigorous theoretical account of why this approach works, which I actually appreciate. I think we need to have space in ML for people to publish what (at least to some) might be unintuitive empirical results, without necessarily feeling pressure to articulate a theory that may just be a half-baked after-the-fact justification. One criticism or caveat I have about this paper is that I wish they'd evaluated what happened if they didn't use any augmentation. Does pre-training do better in that case? Does the training process they're using just break down? Only testing on settings with augmentations made me a little less confident in the generality of their result. Their best guess is that it demonstrates the value of task-specificity in your training. I think there's a bit of that, but also feel like this ties in with other papers I've read recently on the surprising efficacy of training with purely random labels. I think there's, in general, a lot we don't know about what ostensibly supervised networks learn in the face of noisy or even completely permuted labels. ![]() |
[link]
[First off, full credit that this summary is essentially a distilled-for-my-own-understanding compression of Yannic Kilcher's excellent video on the topic] I'm interested in learning more about Neural Radiance Fields (or NERFs), a recent technique for learning a representation of a scene that lets you generate multiple views from it, and a paper referenced as a useful prerequisite for that technique was SIRENs, or Sinuisodial Representation Networks. In my view, the most complex part of understanding this technique isn't the technique itself, but the particularities of the problem being solved, and the ways it differs from a more traditional ML setup. Typically, the goal of machine learning is to learn a model that extracts and represents properties of a data distribution, and that can generalize to new examples drawn from that distribution. Instead, in this framing, a single network is being used to capture information about a single image, essentially creating a compressed representation of that image that brings with it some nice additional properties. Concretely, the neural network is representing a function that maps inputs of the form (x, y), representing coordinates within the image, to (r, g, b) values, representing the pixel values of the image at that coordinate. If you're able to train an optimal version of such a network, it would mean you have a continuous representation of the image. A good way to think about "continuous," here, is that, you could theoretically ask the model for the color value at pixel (3.5, 2.5), and, given that it's simply a numerical mapping, it could give you a prediction, even though in your discrete "sampling" of pixels, that pixel never appears. Given this problem setting, the central technique proposed by SIRENs is to use sinusoidal non-linearities between the layers. On the face of it, this may seem like a pretty weird choice: non-linearities are generally monotonic, and a sine wave is absolutely not that. The appealing property of sinusoidal activations in this context is: if you take a derivative of a sine curve, what you get is a cosine curve (which is essentially a shifted sine curve), and the same is true in reverse. This means that you can take multiple derivatives of the learned function (where, again, "learned function" is your neural network optimized for this particular image), and have them still be networks of the same underlying format, with shifting constants. This allows SIRENs to use an enhanced version of what would be a typical training procedure for this setting. Simplistically, the way you'd go about training this kind of representation would be to simply give the inputs, and optimize against a loss function that reduced your prediction error in predicting the output values, or, in other words, the error on the f(x, y) function itself. When you have a model structure that makes it easy to take first and second derivatives of the function calculated by the model, you can, as this paper does, decide to train against a loss function of matching, not just the true f(x, y) function (again, the pixel values at coordinates), but also the first and second-derivatives (gradients and Laplacian) of the image at those coordinates. This supervision lets you learn a better underlying representation, since it enforces not just what comes "above the surface" at your sampled pixels, but the dynamics of the true function between those points. One interesting benefit of this procedure of using loss in a first or second derivative space (as pointed out in the paper), is that if you want to merge the interesting parts of multiple images, you can approximate that by training a SIREN on the sum of their gradients, since places where gradients are zero likely don't contain much contrast or interesting content (as an example: a constant color background). The Experiments section goes into a lot of specific applications in boundary-finding problems, which I understand at less depth, and thus won't try to explain. It also briefly mentions trying to learn a prior over the space of image functions (that is, a prior over the set of network weights that define the underlying function of an image); having such a prior is interesting in that it would theoretically let you sample both the implicit image function itself (from the prior), and then also points within that function. ![]() |