Welcome to ShortScience.org! |
[link]
This paper presents a combination of the inception architecture with residual networks. This is done by adding a shortcut connection to each inception module. This can alternatively be seen as a resnet where the 2 conv layers are replaced by a (slightly modified) inception module. The paper (claims to) provide results against the hypothesis that adding residual connections improves training, rather increasing the model size is what makes the difference. |
[link]
Ilyas et al. present a follow-up work to their paper on the trade-off between accuracy and robustness. Specifically, given a feature $f(x)$ computed from input $x$, the feature is considered predictive if $\mathbb{E}_{(x,y) \sim \mathcal{D}}[y f(x)] \geq \rho$; similarly, a predictive feature is robust if $\mathbb{E}_{(x,y) \sim \mathcal{D}}\left[\inf_{\delta \in \Delta(x)} yf(x + \delta)\right] \geq \gamma$. This means, a feature is considered robust if the worst-case correlation with the label exceeds some threshold $\gamma$; here the worst-case is considered within a pre-defined set of allowed perturbations $\Delta(x)$ relative to the input $x$. Obviously, there also exist predictive features, which are however not robust according to the above definition. In the paper, Ilyas et al. present two simple algorithms for obtaining adapted datasets which contain only robust or only non-robust features. The main idea of these algorithms is that an adversarially trained model only utilizes robust features, while a standard model utilizes both robust and non-robust features. Based on these datasets, they show that non-robust, predictive features are sufficient to obtain high accuracy; similarly training a normal model on a robust dataset also leads to reasonable accuracy but also increases robustness. Experiments were done on Cifar10. These observations are supported by a theoretical toy dataset consisting of two overlapping Gaussians; I refer to the paper for details. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Contrastive learning works by performing augmentations on a batch of images, and training a network to match the representations of the two augmented parts of a pair together, and push the representations of images not in a pair farther apart. Historically, these algorithms have benefitted from using stronger augmentations, which has the effect of making the two positive elements in a pair more visually distinct from one another. This paper tries to build on that success, and, beyond just using a strong augmentation, tries to learn a way to perturb images that adversarially increases contrastive loss. As with adversarial training in normal supervised setting, the thinking here is that examples which push loss up the highest are the hardest and thus most informative for the network to learn from While the concept of this paper made some sense, I found the notation and the explanation of mechanics a bit confusing, particularly when it came to choice to frame a contrastive loss as a cross-entropy loss, with the "weights" of the dot product in the the cross-entropy loss being, in fact, the projection by the learned encoder of various of the examples in the batch. https://i.imgur.com/iQXPeXk.png This notion of the learned representations being "weights" is just odd and counter-intuitive, and the process of trying to wrap my mind around it isn't one I totally succeeded at. I think the point of using this frame is because it provides an easy analogue to the Fast Gradient Sign Method of normal supervised learning adversarial examples, even though it has the weird effect that, as the authors say "your weights vary by batch...rather than being consistent across training," Notational weirdness aside, my understanding is that the method of this paper: - Runs a forward pass of normal contrastive loss (framed as cross-entropy loss) which takes augmentations p and q and runs both forward through an encoder. - Calculates a delta to apply to each input image in the q that will increase the loss most, taken over all the images in the p set - I think the delta is per-image in q, and is just aggregated over all images in p, but I'm not fully confident of this, as a result of notational confusion. It could also be one delta applied for all all images in q. - Calculate the loss that results when you run forward the adversarially generated q against the normal p - Train a combined loss that is a weighted combination of the normal p/q contrastive part and the adversarial p/q contrastive part https://i.imgur.com/UWtJpVx.png The authors show a small but relatively consistent improvement to performance using their method. Notably, this improvement is much stronger when using larger encoders (presumably because they have more capacity to learn from harder examples). One frustration I have with the empirics of the paper is that, at least in the main paper, they don't discuss the increase in training time required to calculate these perturbations, which, a priori, I would imagine to be nontrivial. |
[link]
Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server. Historically, the default way to do Federated Learning was with an algorithm called FedSGD, which worked by: - Sending a copy of the current model to each device/client - Calculating a gradient update to be applied on top of that current model given a batch of data sampled from the client's device - Sending that gradient back to the central server - Averaging those gradients and applying them all at once to a central model The authors note that this approach is equivalent to one where a single device performs a step of gradient descent locally, sends the resulting *model* back to the the central server, and performs model averaging by averaging the parameter vectors there. Given that, and given their observation that, in federated learning, communication of gradients and models is generally much more costly than the computation itself (since the computation happens across so many machines), they ask whether the communication required to get to a certain accuracy could be better optimized by performing multiple steps of gradient calculation and update on a given device, before sending the resulting model back to a central server to be average with other clients models. Specifically, their algorithm, FedAvg, works by: - Dividing the data on a given device into batches of size B - Calculating an update on each batch and applying them sequentially to the starting model sent over the wire from the server - Repeating this for E epochs Conceptually, this should work perfectly well in the world where data from each batch is IID - independently drawn from the same distribution. But that is especially unlikely to be true in the case of federated learning, when a given user and device might have very specialized parts of the data space, and prior work has shown that there exist pathological cases where averaged models can perform worse than either model independently, even *when* the IID condition is met. The authors experiment empirically ask the question whether these sorts of pathological cases arise when simulating a federated learning procedure over MNIST and a language model trained on Shakespeare, trying over a range of hyperparameters (specifically B and E), and testing the case where data is heavily non-IID (in their case: where different "devices" had non-overlapping sets of digits). https://i.imgur.com/xq9vi8S.png They show that, in both the IID and non-IID settings, they are able to reach their target accuracy, and are able to do so with many fewer rounds of communciation than are required by FedSGD (where an update is sent over the wire, and a model sent back, for each round of calculation done on the device.) The authors argue that this shows the practical usefulness of a Federated Learning approach that does more computation on individual devices before updating, even in the face of theoretical pathological cases. |
[link]
In this note, I'll implement the [Stochastically Unbiased Marginalization Objective (SUMO)](https://openreview.net/forum?id=SylkYeHtwr) to estimate the log-partition function of an energy funtion. Estimation of log-partition function has many important applications in machine learning. Take latent variable models or Bayeisian inference. The log-partition function of the posterior distribution $$p(z|x)=\frac{1}{Z}p(x|z)p(z)$$ is the log-marginal likelihood of the data $$\log Z = \log \int p(x|z)p(z)dz = \log p(x)$$. More generally, let $U(x)$ be some energy function which induces some density function $p(x)=\frac{e^{-U(x)}}{\int e^{-U(x)} dx}$. The common practice is to look at a variational form of the log-partition function, $$ \log Z = \log \int e^{-U(x)}dx = \max_{q(x)}\mathbb{E}[-U(x)-\log q(x)] \nonumber $$ Plugging in an arbitrary $q$ would normally yield a strict lower bound, which means $$ \frac{1}{n}\sum_{i=1}^n \left(-U(x_i) - \log q(x_i)\right) \nonumber $$ for $x_i$ sampled *i.i.d.* from $q$, would be a biased estimate for $\log Z$. In particular, it would be an underestimation. To see this, lets define the energy function $U$ as follows: $$ U(x_1,x_2)= - \log \left(\frac{1}{2}\cdot e^{-\frac{(x_1+2)^2 + x_2^2}{2}} + \frac{1}{2}\cdot\frac{1}{4}e^{-\frac{(x_1-2)^2 + x_2^2}{8}}\right) \nonumber $$ It is not hard to see that $U$ is the energy function of a mixture of Gaussian distribution $\frac{1}{2}\mathcal{N}([-2,0], I) + \frac{1}{2}\mathcal{N}([2,0], 4I)$ with a normalizing constant $Z=2\pi\approx6.28$ and $\log Z\approx1.8379$. ```python def U(x): x1 = x[:,0] x2 = x[:,1] d2 = x2 ** 2 return - np.log(np.exp(-((x1+2) ** 2 + d2)/2)/2 + np.exp(-((x1-2) ** 2 + d2)/8)/4/2) ``` To visualize the density corresponding to the energy $p(x)\propto e^{-U(x)}$ ```python xx = np.linspace(-5,5,200) yy = np.linspace(-5,5,200) X = np.meshgrid(xx,yy) X = np.concatenate([X[0][:,:,None], X[1][:,:,None]], 2).reshape(-1,2) unnormalized_density = np.exp(-U(X)).reshape(200,200) plt.imshow(unnormalized_density) plt.axis('off') ``` https://i.imgur.com/CZSyIQp.png As a sanity check, lets also visualize the density of the mixture of Gaussians. ```python N1, N2 = mvn([-2,0], 1), mvn([2,0], 4) density = (np.exp(N1.logpdf(X))/2 + np.exp(N2.logpdf(X))/2).reshape(200,200) plt.imshow(density) plt.axis('off') print(np.allclose(unnormalized_density / density - 2*np.pi, 0)) ``` `True` https://i.imgur.com/g4inQxB.png Now if we estimate the log-partition function by estimating the variational lower bound, we get ```python q = mvn([0,0],5) xs = q.rvs(10000*5) elbo = - U(xs) - q.logpdf(xs) plt.hist(elbo, range(-5,10)) print("Estimate: %.4f / Ground true: %.4f" % (elbo.mean(), np.log(2*np.pi))) print("Empirical variance: %.4f" % elbo.var()) ``` `Estimate: 1.4595 / Ground true: 1.8379` `Empirical variance: 0.9921` https://i.imgur.com/vFzutuY.png The lower bound can be tightened via [importance sampling): $$ \log \int e^{-U(x)} dx \geq \mathbb{E}_{q^K}\left[\log\left(\frac{1}{K}\sum_{j=1}^K \frac{e^{-U(x_j)}}{q(x_j)}\right)\right] \nonumber $$ > This bound is tighter for larger $K$ partly due to the [concentration of the average](https://arxiv.org/pdf/1906.03708.pdf) inside of the $\log$ function: when the random variable is more deterministic, using a local linear approximation near its mean is more accurate as there's less "mass" outside of some neighborhood of the mean. Now if we use this new estimator with $K=5$ ```python k = 5 xs = q.rvs(10000*k) elbo = - U(xs) - q.logpdf(xs) iwlb = elbo.reshape(10000,k) iwlb = np.log(np.exp(iwlb).mean(1)) plt.hist(iwlb, range(-5,10)) print("Estimate: %.4f / Ground true: %.4f" % (iwlb.mean(), np.log(2*np.pi))) print("Empirical variance: %.4f" % iwlb.var()) ``` `Estimate: 1.7616 / Ground true: 1.8379` `Empirical variance: 0.1544` https://i.imgur.com/sCcsQd4.png We see that both the bias and variance decrease. Finally, we use the [Stochastically Unbiased Marginalization Objective](https://openreview.net/pdf?id=SylkYeHtwr) (SUMO), which uses the *Russian Roulette* estimator to randomly truncate a telescoping series that converges in expectation to the log partition function. Let $\text{IWAE}_K = \log\left(\frac{1}{K}\sum_{j=1}^K \frac{e^{-U(x_j)}}{q(x_j)}\right)$ be the importance-weighted estimator, and $\Delta_K = \text{IWAE}_{K+1} - \text{IWAE}_K$ be the difference (which can be thought of as some form of correction). The SUMO estimator is defined as $$ \text{SUMO} = \text{IWAE}_1 + \sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \nonumber $$ where $K\sim p(K)=\mathbb{P}(\mathcal{K}=K)$. To see why this is an unbiased estimator, $$ \begin{align*} \mathbb{E}[\text{SUMO}] &= \mathbb{E}\left[\text{IWAE}_1 + \sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \right] \nonumber\\ &= \mathbb{E}_{x's}\left[\text{IWAE}_1 + \mathbb{E}_{K}\left[\sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \right]\right] \nonumber \end{align*} $$ The inner expectation can be further expanded $$ \begin{align*} \mathbb{E}_{K}\left[\sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \right] &= \sum_{K=1}^\infty P(K)\sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \\ &= \sum_{k=1}^\infty \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \sum_{K=k}^\infty P(K) \\ &= \sum_{k=1}^\infty \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \mathbb{P}(\mathcal{K}\geq k) \\ &= \sum_{k=1}^\infty\Delta_K \\ &= \text{IWAE}_{2} - \text{IWAE}_1 + \text{IWAE}_{3} - \text{IWAE}_2 + ... = \lim_{k\rightarrow\infty}\text{IWAE}_{k}-\text{IWAE}_1 \end{align*} $$ which shows $\mathbb{E}[\text{SUMO}] = \mathbb{E}[\text{IWAE}_\infty] = \log Z$. > (N.B.) Some care needs to be taken care of for taking the limit. See the paper for more formal derivation. A choice of $P(K)$ proposed in the paper satisfy $\mathbb{P}(\mathcal{K}\geq K)=\frac{1}{K}$. We can sample such a $K$ easily using the [inverse CDF](https://en.wikipedia.org/wiki/Inverse_transform_sampling), $K=\lfloor\frac{u}{1-u}\rfloor$ where $u$ is sampled uniformly from the interval $[0,1]$. Now putting things all together, we can estimate the log-partition using SUMO. ```python count = 0 bs = 10 iwlb = list() while count <= 1000000: u = np.random.rand(1) k = np.ceil(u/(1-u)).astype(int)[0] xs = q.rvs(bs*(k+1)) elbo = - U(xs) - q.logpdf(xs) iwlb_ = elbo.reshape(bs, k+1) iwlb_ = np.log(np.cumsum(np.exp(iwlb_), 1) / np.arange(1,k+2)) iwlb_ = iwlb_[:,0] + ((iwlb_[:,1:k+1] - iwlb_[:,0:k]) * np.arange(1,k+1)).sum(1) count += bs * (k+1) iwlb.append(iwlb_) iwlb = np.concatenate(iwlb) plt.hist(iwlb, range(-5,10)) print("Estimate: %.4f / Ground true: %.4f" % (iwlb.mean(), np.log(2*np.pi))) print("Empirical variance: %.4f" % iwlb.var()) ``` `Estimate: 1.8359 / Ground true: 1.8379` `Empirical variance: 4.1794` https://i.imgur.com/04kPKo5.png Indeed the empirical average is quite close to the true log-partition of the energy function. However we can also see that the distribution of the estimator is much more spread-out. In fact, it is very heavy-tailed. Note that I did not tune the proposal distribution $q$ based on the ELBO, or IWAE or SUMO. In the paper, the authors propose to tune $q$ to minimize the variance of the $\text{SUMO}$ estimator, which might be an interesting trick to look at next. (Reposted, see more details and code from https://www.chinweihuang.com/pages/sumo) |