![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
The idea of the Switch Transformer is to have more parameters available for a network to use, but to only use a small subset of those parameters for each example that's run through the network. This is achieved through a routing scheme, whereby a weighting layer is applied to each token and produces a set of logits/softmax weights over the set of possible experts. The token is then sent to the expert that was given the highest weight. The network is implemented such that different experts can actually live on different devices. https://i.imgur.com/HEB7cJw.png This architecture is inspired by previous Mixture of Experts work, which applied a similar scheme, but sent each token through a set of k experts rather than just a single one. This had the ostensible effect of increasing stability and performance, but the authors of this paper argue that using a single expert per token is actually preferable on both of these fronts. There are a lot of experiments in this paper, and I'd recommend taking a look at them in detail if you're interested, but, at a high level, they found evidence that, compared to models with a comparable amount of parameters they were indeed able to get comparable or better performance with a lower number of FLOPS. It also meant they were able to build up to a trillion-parameter model, without having unreasonable computation requirements. Some interesting considerations relevant to this approach: - To keep training speed up, you need to strike the right balance of the number of tokens sent to each expert; in this case, the authors added a loss term to incentivize the division between experts to be roughly uniform - There was some numerical instability around the expert training procedure if you used float16 data types, so they switched to using float32, but only within the experts themselves, rather than in the rest of the network. - To regularize this huge of a network, the authors decided to apply dropout, but only within the experts ![]() |
[link]
In the past year or so, contrastive learning has experienced widespread success, and has risen to be a dominant problem framing within self-supervised learning. The basic idea of contrastive learning is that, instead of needing human-generated labels to generate a supervised task, you instead assume that there exists some automated operation you can perform to a data element to generate another data element that, while different, should be considered still fundamentally the same, or at least more strongly related, and that you can contrast these related pairs against pairs constructed with the rest of the dataset, with which any given frame would not by default have this assumed relationship of sameness or quasi-similarity. One fairly central way that "different but still effectively similar" has been historically defined - at least within the realm of image-based models - is through the use of data augmentations: image transformations such as cropping, color jitter, or Gaussian blur, which are used on an image to create the counterpart in its related pair. Fundamentally, what we're doing when we define these particular augmentations is saying: these transformations don't cause a meaningful change in what the image is, and so we want the representations we get with and without the transformations to be close to one another (or at least to contain enough information to predict one another). Another way you can say this is that we're defining properties of the image that we want our representation to be invariant to. The authors of this paper make the point that, when aggressive cropping is part of your toolkit of augmentations, the crops of the image can actually contain meaningfully different content than the uncropped image. If you remove a few pixels around the edges of an image, it still fundamentally contains the same things. However, if you zoom in dramatically, you may get crops that contain different objects. From an image classification standpoint, you would expect that coding in an invariance to cropping in our representations would, in some cases, also mean coding in an invariance to object type, which would presumably be detrimental to the task of classifying objects. To explain the extent of the success that aggressive-cropping methods have had so far, they argue that ImageNet has the particular property that its images are curated to be primarily and centrally containing a single object at a time, such that, even if you zoom in, you're getting a part of the central object, rather than another object entirely. They argue that this dataset bias might explain why you haven't seen as much of this object-invariance be a problem in earlier augmentation-based contrastive work. To try to test this, they train different contrastive (MoCo v2) models on the MSCOCO dataset, which consists of pictures of rooms, and thus no longer has the property of being centrally of one object. They tried one setting where they performed contrastive loss on the images as a whole, and another where the input to the augmentation pipeline were images from the same dataset, but pre-cropped to only contain one image at a time. This was meant, as far as I can tell, to isolate the effect of "object-centric vs not" while holding other dataset factors constant. They then test how well these different models do on an object-centric classification task (Pascal Cropped Boxes). They find that the contrastive model that trains cropped versions of the dataset gets about 3.5 points higher mean accuracy (71.9 vs 75.3) compared to the contrastive loss done on the multi-object versions of images. They also explicitly try to measure different forms of invariance, through a scheme where they binarize the elements of the representation vector, and calculate what proportion of them fire on average with and without a given set of transformations. They find that the main form of invariances that contrastive learning does well at is invariance to occlusion (part of the image not being visible), where both contrastive methods give ~84 percent co-firing, and supervised pre-training only gets about 80.9. However, on other important measures of invariance - viewpoint, illumination direction, and instance (that is, specific instance within a class) - contrastive representations perform notably worse than supervised pretraining. https://i.imgur.com/7Ghbv5A.png To try to solve these two problems, they propose a method that learns from video, and that uses temporally separated frames (which are then augmented) as pairs. They call this Frame Temporal Invariance, and argue, reasonably, that by pushing the representations of adjacent frames which track a (presumably consistent, or at least slowly-evolving) scene closer together, you should expect better invariance to viewpoint change and image deformation, since those things naturally happen when an object is moving through the world. They also suggest using an off-the-shelf object bounding box model to find particular objects, and track them throughout the video, and to use contrastive learning specifically on the bounding boxes that the algorithm thinks track a consistent object. https://i.imgur.com/2GfCTog.png Overall, my take on this paper is that the analysis they do - of different kinds of invariances contrastive vs supervised loss does well on, and of the extent to which contrastive loss results might be biased by datasets - is quite interesting and a valuable contribution to our understanding of these very currently-hypey algorithms. However, I'm a bit less impressed by the novelty of their proposed solution. Temporal forms of contrastive learning have been around before - in reinforcement learning, and even in the original Contrastive Predictive Coding paper, where the related pairs were related by dint of temporal closeness. So, while using it in video is certainly a good idea, it doesn't really feel strongly novel to me. I also feel a little confused by their choice of using an off-the-shelf object detection model as a pre-requisite for a self-supervised task, since my impression was that a central goal of self-supervision was building techniques that could scale to situations where it was infeasible to get large amounts of labels, and any method that relies on a pre-existing trained object bounding box model is pretty inherently limited in that regard. ![]() |
[link]
Cover's Universal Portfolio is an information-theoretic portfolio optimization algorithm that utilizes constantly rebalanced porfolios (CRP). A CRP is one in which the distribution of wealth among stocks in the portfolio remains the same from period to period. Universal Portfolio strictly performs rebalancing based on historical pricing, making no assumptions about the underlying distribution of the prices. The wealth achieved by a CRP over n periods is: $S_n(b,x^n) = \displaystyle \prod_{n}^{i=1} b \cdot x$ The key takeaway: Where $\mathrm{b}$ is the allocation vector. Cover takes the integral of the wealth over the entire portfolio to give $b_{t+1}$. This is what makes it "universal". Most implementations in practice do this discretely, by creating a matrix $\mathrm{B}$ with each row containing a combination of the percentage allocatio, and calculating $\mathrm{S} = \mathrm{B\dot x}$. Cover mentions trading costs will eat away most of the gains, especially if this algorithm is allowed to rebalance daily. Nowadays, there are commission-free brokers. See this summary for Universal Portfolios without transaction costs: \cite{conf/colt/BlumK97} ![]()
2 Comments
|
[link]
This paper is about a recommendation system approach using collaborative filtering (CF) on implicit feedback datasets. The core of it is the minimization problem $$\min_{x_*, y_*} \sum_{u,i} c_{ui} (p_{ui} - x_u^T y_i)^2 + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$ with * $\lambda \in [0, \infty[$ is a hyper parameter which defines how strong the model is regularized * $u$ denoting a user, $u_*$ are all user factors $x_u$ combined * $i$ denoting an item, $y_*$ are all item factors $y_i$ combined * $x_u \in \mathbb{R}^n$ is the latent user factor (embedding); $n$ is another hyper parameter. $n=50$ seems to be a reasonable choice. * $y_i \in \mathbb{R}^n$ is the latent item factor (embedding) * $r_{ui}$ defines the "intensity"; higher values mean user $u$ interacted more with item $i$ * $p_{ui} = \begin{cases}1 & \text{if } r_{ui} >0\\0 &\text{otherwise}\end{cases}$ * $c_{ui} := 1 + \alpha r_{ui}$ where $\alpha \in [0, \infty[$ is a hyper parameter; $\alpha =40$ seems to be reasonable In contrast, the standard matrix factoriation optimization function looks like this ([example](https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture25-mf.pdf)): $$\min_{x_*, y_*} \sum_{(u, i, r_{ui}) \in \mathcal{R}} {(r_{ui} - x_u^T y_i)}^2 + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$ where * $\mathcal{R}$ is the set of all ratings $(u, i, r_{ui})$ - user $u$ has rated item $i$ with value $r_{ui} \in \mathbb{R}$ They use alternating least squares (ALS) to train this model. The prediction then is the dot product between the user factor and all item factors ([source](https://github.com/benfred/implicit/blob/master/implicit/recommender_base.pyx#L157-L176)) ![]() |
[link]
In binary classification task on an imbalanced dataset, we often report *area under the curve* (AUC) of *receiver operating characteristic* (ROC) as the classifier's ability to distinguish two classes. If there are $k$ errors, accuracy will be the same irrespective of how those $k$ errors are made i.e. misclassification of positive samples or misclassification of negative samples. AUC-ROC is a metric that treats these misclassifications asymmetrically, making it an appropriate statistic for classification tasks on imbalanced datasets. However, until this paper, AUC-ROC was hard to quantify and differentiate to gradient-descent over. This paper approximated AUC-ROC by a Wilcoxon-Mann-Whitney statistic which counts the "number of wins" in all the pairwise comparisons - $ U = \frac{\sum_{i=1}^{m}\sum_{j=1}^{n}I(x_i, x_j)}{mn}, $ where $m$ is the total number of positive samples, $n$ is the number of negative samples, and $I(x_i, x_j)$ is $1$ if $x_i$ is ranked higher than $x_j$. Figure 1 in the paper shows the variance of this statistic with an increasing imbalance in the dataset, justifying the close correspondence with AUC-ROC. Further, to make this metric smooth and differentiable, the step function of pairwise comparison is replaced by sigmoid or hinge functions. Further extensions are made to apply this to multi-class classification tasks and focus on top-K predictions i.e. optimize lower-left part of AUC. ![]() |