![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Transformers - powered by self-attention mechanisms - have been a paradigm shift in NLP, and are now the standard choice for training large language models. However, while transformers do have many benefits in terms of computational constraints - most saliently, that attention between tokens can be computed in parallel, rather than needing to be evaluated sequentially like in a RNN - a major downside is their memory (and, secondarily, computational) requirements. The baseline form of self-attention works by having every token attend to every other token, where "attend" here means that a query from each token A will take an inner product with each other token -A, and then be elementwise-multiplied with the values of every other token -A. This implies a O(N^2) memory and computation requirement, where N is your sequence length. So, the question this paper asks is: how do you get the benefits, or most of the benefits, of a full-attention network, while reducing the number of other tokens each token attends to. The authors' solution - Big Bird - has three components. First, they approach the problem of approximating the global graph as a graph theory problem, where each token attending to every other is "fully connected," and the goal is to try to sparsify the graph in a way that keeps shortest path between any two nodes low. They use the fact that in an Erdos-Renyi graph - where very edge is simply chosen to be on or off with some fixed probability - the shortest path is known to be logN. In the context of aggregating information about a sequence, a short path between nodes means that the number of iterations, or layers, that it will take for information about any given node A to be part of the "receptive field" (so to speak) of node B, will be correspondingly short. Based on this, they propose having the foundation of their sparsified attention mechanism be simply a random graph, where each node attends to each other with probability k/N, where k is a tunable hyperparameter representing how many nodes each other node attends to on average. To supplement, the authors further note that sequence tasks of interest - particularly language - are very local in their information structure, and, while it's important to understand the global context of the full sequence, tokens close to a given token are most likely to be useful in constructing a representation of it. Given this, they propose supplementing their random-graph attention with a block diagonal attention, where each token attends to w/2 tokens prior to and subsequent to itself. (Where, again, w is a tunable hyperparameter) However, the authors find that these components aren't enough, and so they add a final component: having some small set of tokens that attend to all tokens, and are attended to by all tokens. This allows them to theoretically prove that Big Bird can approximate full sequences, and is a universal Turing machine, both of which are true for full Transformers. I didn't follow the details of the proof, but, intuitively, my reading of this is that having a small number of these global tokens basically acts as a shortcut way for information to get between tokens in the sequence - if information is globally valuable, it can be "written" to one of these global aggregator nodes, and then all tokens will be able to "read" it from there. The authors do note that while their sparse model approximates the full transformer well in many settings, there are some problems - like needing to find the token in the sequence that a given token is farthest from in vector space - that a full attention mechanism could solve easily (since it directly calculates all pairwise comparisons) but that a sparse attention mechanism would require many layers to calculate. Empirically, Big Bird ETC (a version which adds on additional tokens for the global aggregators, rather than making existing tokens serve thhttps://i.imgur.com/ks86OgJ.pnge purpose) performs the best on a big language model training objective, has comparable performance to existing models on questionhttps://i.imgur.com/x0BdamC.png answering, and pretty dramatic performance improvements in document summarization. It makes sense for summarization to be a place where this model in particular shines, because it's explicitly designed to be able to integrate information from very large contexts (albeit in a randomly sampled way), where full-attention architectures must, for reasons of memory limitation, do some variant of a sliding window approach. ![]() |
[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} -0.656 \\\ -0.652 \\\ -0.379 \end{array}\right], H = \left[\begin{array}{c c c} -6.48 & -6.26 & -3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft ![]() |
[link]
This is a nice little empirical paper that does some investigation into which features get learned during the course of neural network training. To look at this, it uses a notion of "decodability", defined as the accuracy to which you can train a linear model to predict a given conceptual feature on top of the activations/learned features at a particular layer. This idea captures the amount of information about a conceptual feature that can be extracted from a given set of activations. They work with two synthetic datasets. 1. Trifeature: Generated images with a color, shape, and texture, which can be engineered to be either entirely uncorrelated or correlated with each other to varying degrees. 2. Navon: Generated images that are letters on the level of shape, and are also composed of letters on the level of texture The first thing the authors investigate is: to what extent are the different properties of these images decodable from their representations, and how does that change during training? In general, decodability is highest in lower layers, and lowest in higher layers, which makes sense from the perspective of the Information Processing Inequality, since all the information is present in the pixels, and can only be lost in the course of training, not gained. They find that decodability of color is high, even in the later layers untrained networks, and that the decodability of texture and shape, while much less high, is still above chance. When the network is trained to predict one of the three features attached to an image, you see the decodability of that feature go up (as expected), but you also see the decodability of the other features go down, suggesting that training doesn't just involve amplifying predictive features, but also suppressing unpredictive ones. This effect is strongest in the Trifeature case when training for shape or color; when training for texture, the dampening effect on color is strong, but on shape is less pronounced. https://i.imgur.com/o45KHOM.png The authors also performed some experiments on cases where features are engineered to be correlated to various degrees, to see which of the predictive features the network will represent more strongly. In the case where two features are perfectly correlated (and thus both perfectly predict the label), the network will focus decoding power on whichever feature had highest decodability in the untrained network, and, interestingly, will reduce decodability of the other feature (not just have it be lower than the chosen feature, but decrease it in the course of training), even though it is equally as predictive. https://i.imgur.com/NFx0h8b.png Similarly, the network will choose the "easy" feature (the one more easily decodable at the beginning of training) even if there's another feature that is slightly *more* predictive available. This seems quite consistent with the results of another recent paper, Shah et al, on the Pitfalls of Simplicity Bias in neural networks. The overall message of both of these experiments is that networks generally 'put all their eggs in one basket,' so to speak, rather than splitting representational power across multiple features. There were a few other experiments in the paper, and I'd recommend reading it in full - it's quite well written - but I think those convey most of the key insights for me. ![]() |
[link]
The goal of this paper is to learn a model that embeds 2D keypoints(the locations of specific key body parts in 2D space) representing a particular pose into a vector embedding where nearby points in embedding space are also nearby in 3D space. This sort of model is useful because the same 3D pose can generate a wide variety of 2D pose projections, and it can be useful to learn which apparently-distinct representations actually map to the same 3D pose. To do this, the basic approach used by the authors (with has a few variants), is - Take a dataset of 3D poses, and corresponding 2D projections - Define a notion of "matching" 3D poses, based on a parameter kappa, which designates the maximum average per-joint distance at which two 3D poses can be considered the same - Construct triplets composed of an anchor pose, a "positive" pose (a different 2D pose with a matching 3D pose), and a "negative" pose (some other 2D pose sampled from the dataset using a strategy that explicitly seeks out hard negative examples) - Calculate a triplet loss, that pushes positive examples closer together, and pulls negative examples farther apart. This is specifically done by defining a probabilistic representation of p(match | z1, z2), or, the probability of a match in 3D space given the embeddings of the two 2D poses. This is parametrized using a sigmoid with trainable parameters, as shown below https://i.imgur.com/yFCCVuA.png - They they calculate a distance kernel as the negative log of that probability, and calculate the basic triplet loss, which tries to maximize the diff between the the distance between negative examples, and the distance between positive examples. - They also add an additional loss further incentivizing the match probability to be higher on the positive pair (in addition to just pushing the positive and negative pair further apart) - The final loss is a Gaussian prior loss, incentivizing the learned embeddings z to be in the shape of a Gaussian https://i.imgur.com/SxvcvJG.png This represents the central shape of the method. Some additional ablations include: - Camera Augmentation: Creational additional triplets by taking existing 3D poses and generating artificial pairs of 2D poses at different camera views - Temporal Pose Embedding - Embedding multiple temporally connected pose, rather than just a single one - Keypoint Dropout - To simulate situations where some keypoints are occluded, the authors tried training with some keypoints dropped out, either keypoints selected at random, or selected jointly and non-independently based on a model of which keypoints are likely to be occluded together The authors found that their method was generally quite a bit stronger that prior approaches for the task of querying similar 3D poses given a 2D pose input, including some alternate methods that do direct 3D estimation. ![]() |
[link]
Cover's Universal Portfolio is an information-theoretic portfolio optimization algorithm that utilizes constantly rebalanced porfolios (CRP). A CRP is one in which the distribution of wealth among stocks in the portfolio remains the same from period to period. Universal Portfolio strictly performs rebalancing based on historical pricing, making no assumptions about the underlying distribution of the prices. The wealth achieved by a CRP over n periods is: $S_n(b,x^n) = \displaystyle \prod_{n}^{i=1} b \cdot x$ The key takeaway: Where $\mathrm{b}$ is the allocation vector. Cover takes the integral of the wealth over the entire portfolio to give $b_{t+1}$. This is what makes it "universal". Most implementations in practice do this discretely, by creating a matrix $\mathrm{B}$ with each row containing a combination of the percentage allocatio, and calculating $\mathrm{S} = \mathrm{B\dot x}$. Cover mentions trading costs will eat away most of the gains, especially if this algorithm is allowed to rebalance daily. Nowadays, there are commission-free brokers. See this summary for Universal Portfolios without transaction costs: \cite{conf/colt/BlumK97} ![]()
2 Comments
|