Welcome to ShortScience.org! |
[link]
A very simple (but impractical) discrete model of subclonal evolution would include the following events: * Division of a cell to create two cells: * **Mutation** at a location in the genome of the new cells * Cell death at a new timestep * Cell survival at a new timestep Because measurements of mutations are usually taken at one time point, this is taken to be at the end of a time series of these events, where a tiny of subset of cells are observed and a **genotype matrix** $A$ is produced, in which mutations and cells are arbitrarily indexed such that $A_{i,j} = 1$ if mutation $j$ exists in cell $i$. What this matrix allows us to see is the proportion of cells which *both have mutation $j$*. Unfortunately, I don't get to observe $A$, in practice $A$ has been corrupted by IID binary noise to produce $A'$. This paper focuses on difference inference problems given $A'$, including *inferring $A$*, which is referred to as **`noise_elimination`**. The other problems involve inferring only properties of the matrix $A$, which are referred to as: * **`noise_inference`**: predict whether matrix $A$ would satisfy the *three gametes rule*, which asks if a given genotype matrix *does not describe a branching phylogeny* because a cell has inherited mutations from two different cells (which is usually assumed to be impossible under the infinite sites assumption). This can be computed exactly from $A$. * **Branching Inference**: it's possible that all mutations are inherited between the cells observed; in which case there are *no branching events*. The paper states that this can be computed by searching over permutations of the rows and columns of $A$. The problem is to predict from $A'$ if this is the case. In both problems inferring properties of $A$, the authors use fully connected networks with two hidden layers on simulated datasets of matrices. For **`noise_elimination`**, computing $A$ given $A'$, the authors use a network developed for neural machine translation called a [pointer network][pointer]. They also find it necessary to map $A'$ to a matrix $A''$, turning every element in $A'$ to a fixed length row containing the location, mutation status and false positive/false negative rate. Unfortunately, reported results on real datasets are reported only for branching inference and are limited by the restriction on input dimension. The inferred branching probability reportedly matches that reported in the literature. [pointer]: https://arxiv.org/abs/1409.0473 |
[link]
This preprint is a bit rambling, and I don't know that I fully followed what it was doing, but here's my best guess: https://i.imgur.com/xC2ryzp.png - We think it's probably the case that SARS-COV2 (COVID19) uses a protease (enzyme involved in its reproduction) that isn't available and co-optable in the human body, and is also quite similar to the comparable protease protein in the original SARS virus. Therefore, it is hoped that we might be able to take inhibitors that bind to SARS, and modify them in small ways to make them bond to SARS-COV2 - The paper notes that it's specifically interested in targeted covalent inhibitors. These are drugs that inhibit the function of a protein by actually covalently binding with the relevant binding pocket, as opposed to most drugs, which by default just fit neatly inside the pocket and occupy it much of the time in equilibrium, but don't actually form permanent, stable covalent bonds with the protein. This class of drugs can be more effective, because its binding is stronger and more permanent, but it can also be more dangerous, because its potential side effects if it binds with a non-intended protein pocket can be more severe. - In order to get a covalently-binding drug that fits with the pocket of SARS-COV2, the authors start with a known inhibitor from SARS, and then use reinforcement learning to make modifications to it. The allowed modification actions consist of adding or removing "fragments" rather than atoms, where "fragments" here refers to coherent subcomponents of other drugs from similar families that were broken up according to hand-coded chemical rules. This approach is more stable than just adding on molecules, because at every stage in generation, the generated molecule will be coherent and chemically sound. - The part I don't fully follow is what they use for the reward function for compounds that are in the process of being made. They specify that they do reward intermediate compounds, rather than just ones at the end of generation, but don't specify what goes into the reward. If I had to guess, I'd imagine it consists of (1) a molecular docking simulation that can't be differentiated through, and thus can't be used directly as a loss function, and/or (2) hand-coded heuristics from chemists for what makes a stable binding |
[link]
Most of the interesting mechanics within living things are mediated by interactions between proteins, making it important and useful to have good predictive models of whether proteins will interact with one another, for validating possible interaction graph structures. Prior methods for this problem - which takes as its input sequence representations of two proteins, and outputs a probability of interaction - have pursued different ideas for how to combine information from the two proteins. On a basic level, you need your method to be independent to the ordering of the proteins, since the property we care about is a property of a pair of proteins, not a property of a particular ordering of proteins. Some examples of those have included: - A kernel function between some representation of proteins - Representing a protein pair according to whether and how often given k-mer sequences co-occur in both proteins This paper's DPPI method is built on a Siamese network, which applies a single shared set of convolutional layers to each of the two proteins, and then calculates a "binding score," that structurally acts a bit like a similarity score, but with allowances for proteins to be complementary, rather than just similar. In more detail: https://i.imgur.com/8ruY9es.png 1. Crop each protein into multiple overlapping subsequences of length 512 amino acids. Perform all following steps for every combination of cropped subsequences between the two proteins. (If A is divided into A1 and A2, you'd do the following steps for A1 x B and A2 x B and take the max score out of the two) 2. Each cropped protein is represented as a probabilistic sequence. Since we can't get fully certain sequences of what amino acid is at each point in the chain, we instead pass in a 20x512 representation, where at each of the 512 locations we have a distribution over 20 possible amino acids. This tensor is passed into multiple layers of convolutional network, with the same network weights applied to each of the two proteins. 3. A random projection is applied to the outputs of the convolutional network. The features that come out of the projection are conceptually similar to feature maps that might come out of a neural network layer, except that the weights aren't learned. This random projection has a specialized structure, in that it's composed of two (randomly-weighted) networks, A and B, each of which result in feature maps A1...K and B1...K. For protein 1, the outputs of the network are ordered A1...AK B1...BK, whereas for protein 2, the weights are swapped, and so the outputs are ordered B1...BK A1...AK. 4. A Hadamard product between the two random projection outputs. This is basically a dot product, but for matrices (you multiply each element in the matrix by the corresponding element in the other matrix). This is basically like calculating a similarity score between the feature values of the randomly projected features. One benefit of doing the odd reordering in the prior step is that it breaks symmetry: if we took a dot product between features calculated by a fully shared-weight network, then we'd be looking explicitly for similarity between sequence features, which might not be sufficient to know if proteins interact in a complementary way. Another benefit is that it makes the final fully connected layer (which predicts interaction) agnostic to the order of inputs. (Caveat: I only about 70% follow the logic of this) In the example above, the 1st element will end up being A1(Prot1) x B1(ProtB), and the K+1th element will end up being B1(Prot1) x A1(ProtB). Since multiplication is order-independent, both values 1 and K+1 represent the similarity between the proteins according to features A/B1. 5. Once you have this final representation, feed it into a fully connected layer https://i.imgur.com/3LsgZNn.png The authors show superior performance to past methods, and even show that they can get 96% accuracy on protein interactions within humans from training on a non-human species, showing that a lot of the biomechanical logic transfers. https://i.imgur.com/REoU3Ab.png They did an ablation test and showed that the random projection layer added value, but also that it was better to have the weights be random than learned, which was surprising to me, and suggests the model as a whole is prone to overfit. |
[link]
My objective in reading this paper was to gain another perspective on, and thus a more well-grounded view of, machine learning scoring functions for docking-based prediction of ligand/protein binding affinity. As quick background context, these models are useful because many therapeutic compounds act by binding to a target protein, and it can be valuable to prioritize doing wet lab testing on compounds that are predicted to have a stronger binding affinity. Docking systems work by predicting the pose in which a compound (or ligand) would bind to a protein, and then scoring prospective poses based on how likely such a pose would be to have high binding affinity. It's important to note that there are two predictive components in such a pipeline, and thus two sources of potential error: the searching over possible binding poses, done by physics-based systems, and scoring of the affinity of a given pose, assuming that were actually the correct one. Therefore, in the second kind of modeling, which this paper focuses on, you take in features *of a particular binding pose*, which includes information like which atoms of the compound are nearby to which atoms of the protein. The actual neural network structure used here was admittedly a bit underwhelming (though, to be fair, many of the ideas it seems to be gesturing at wouldn't be properly formalized until Graph Convolutional Networks came around). I'll describe the network mechanically first, and then offer some commentary on the design choices. https://i.imgur.com/w9wKS10.png 1. For each atom (a) in the compound, a set of neighborhood features are defined. The neighborhood is based on two hyperparameters, one for "how many atoms from the protein should be included," and one for "how many atoms from the compound should be included". In both cases, you start by adding the closest atom from either the compound or protein, and as hyperparameter values of each increase, you add in farther-away atoms. The neighborhood features here are (i) What are the types of the atoms? (ii) What are the partial charges of the atoms? (iii) How far are the atoms from the reference atom? (iiii) What amino acid within the protein do the protein atoms come? 2. All of these features are turned into embeddings. Yes, all of them, even the ones (distance and charge) that are continuous values. Coming from a machine learning perspective, this is... pretty weird as a design choice. The authors straight-up discretize the distance values, and then use those as discrete values for the purpose of looking up embeddings. (So, you'd have one embedding vector for distance (0.25-0.5, and a different one for 0.0-0.25, say). 3. The embeddings are concatenated together into a single "atom neighborhood vector" based on a predetermined ordering of the neighbor atoms and their property vectors. We now have one atom neighborhood vector for each atom in the compound. 4. The authors then do what they call a convolution over the atom neighborhood vectors. But it doesn't act like a normal convolution in the sense of mixing information from nearby regions of atom space. It just is basically a fully connected layer that's applied to atom neighborhood vector separately, but with shared weights, so the same layer is applied to each neighborhood vector. They then do a feature-wise max pool across the layer-transformed version of neighborhood vectors, getting you one vector for the full compound 5. This single vector is then put into a softmax, which predicts whether this ligand (in in this particular pose) will have strong binding with the protein Some thoughts on what's going on here. First, I really don't have a good explanation for why they'd have needed to embed a discretized version of the continuous variables, and since they don't do an ablation test of that design choice, it's hard to know if it mattered. Second, it's interesting to see, in their "convolution" (which I think is more accurately described as a Siamese Network, since it's only convolution-like insofar as there are shared weights), the beginning intuitions of what would become Graph Convolutions. The authors knew that they needed methods to aggregate information from arbitrary numbers of atoms, and also that they need should learn representations that have visibility onto neighborhoods of atoms, rather than single ones, but they do so in an entirely hand-engineered way: manually specifying a fixed neighborhood and pulling in information from all those neighbors equally, in a big concatenated vector. By contrast, when Graph Convolutions come along, they act by defining a "message-passing" function for features to aggregate across graph edges (here: molecular bonds or binaries on being "near enough" to another atom), which similarly allows information to be combined across neighborhoods. And, then, the 'convolution' is basically just a simple aggregation: necessary because there's no canonical ordering of elements within a graph, so you need an order-agnostic aggregation like a sum or max pool. The authors find that their method is able to improve on the hand-designed scoring functions within the docking programs. However, they also find (similar to another paper I read recently) that their model is able to do quite well without even considering structural relationships of the binding pose with the protein, which suggests the dataset (DUD - a dataset of 40 proteins with ~4K correctly binding ligands, and ~35K ligands paired with proteins they don't bind to) and problem given to the model is too easy. It's also hard to tell how I should consider AUCs within this problem - it's one thing to be better than an existing method, but how much value do you get from a given unit of AUC improvement, when it comes to actually meaningfully reducing wetlab time used on testing compounds? I don't know that there's much to take away from this paper in terms of useful techniques, but it is interesting to see the evolution of ideas that would later be more cleanly formalized in other works. |
[link]
This paper focuses on the application of deep learning to the docking problem within rational drug design. The overall objective of drug design or discovery is to build predictive models of how well a candidate compound (or "ligand") will bind with a target protein, to help inform the decision of what compounds are promising enough to be worth testing in a wet lab. Protein binding prediction is important because many small-molecule drugs, which are designed to be small enough to get through cell membranes, act by binding to a specific protein within a disease pathway, and thus blocking that protein's mechanism. The formulation of the docking problem, as best I understand it, is: 1. A "docking program," which is generally some model based on physical and chemical interactions, takes in a (ligand, target protein) pair, searches over a space of ways the ligand could orient itself within the binding pocket of the protein (which way is it facing, where is it twisted, where does it interact with the protein, etc), and ranks them according to plausibility 2. A scoring function takes in the binding poses (otherwise known as binding modes) ranked the highest, and tries to predict the affinity strength of the resulting bond, or the binary of whether a bond is "active". The goal of this paper was to interpose modern machine learning into the second step, as alternative scoring functions to be applied after the pose generation . Given the complex data structure that is a highly-ranked binding pose, the hope was that deep learning would facilitate learning from such a complicated raw data structure, rather than requiring hand-summarized features. They also tested a similar model structure on the problem of predicting whether a highly ranked binding pose was actually the empirically correct one, as determined by some epsilon ball around the spatial coordinates of the true binding pose. Both of these were binary tasks, which I understand to be 1. Does this ranked binding pose in this protein have sufficiently high binding affinity to be "active"? This is known as the "virtual screening" task, because it's the relevant task if you want to screen compounds in silico, or virtually, before doing wet lab testing. 2. Is this ranked binding pose the one that would actually be empirically observed? This is known as the "binding mode prediction" task The goal of this second task was to better understand biases the researchers suspected existed in the underlying dataset, which I'll explain later in this post. The researchers used a graph convolution architecture. At a (very) high level, graph convolution works in a way similar to normal convolution - in that it captures hierarchies of local patterns, in ways that gradually expand to have visibility over larger areas of the input data. The distinction is that normal convolution defines kernels over a fixed set of nearby spatial coordinates, in a context where direction (the pixel on top vs the pixel on bottom, etc) is meaningful, because photos have meaningful direction and orientation. By contrast, in a graph, there is no "up" or "down", and a given node doesn't have a fixed number of neighbors (whereas a fixed pixel in 2D space does), so neighbor-summarization kernels have to be defined in ways that allow you to aggregate information from 1) an arbitrary number of neighbors, in 2) a manner that is agnostic to orientation. Graph convolutions are useful in this problem because both the summary of the ligand itself, and the summary of the interaction of the posed ligand with the protein, can be summarized in terms of graphs of chemical bonds or interaction sites. Using this as an architectural foundation, the authors test both solo versions and ensembles of networks: https://i.imgur.com/Oc2LACW.png 1. "L" - A network that uses graph convolution to summarize the ligand itself, with no reference to the protein it's being tested for binding affinity with 2. "LP" - A network that uses graph convolution on the interaction points between the ligand and protein under the binding pose currently being scored or predicted 3. "R" - A simple network that takes into account the rank assigned to the binding pose by the original docking program (generally used in combination with one of the above). The authors came to a few interesting findings by trying different combinations of the above model modules. First, they found evidence supporting an earlier claim that, in the dataset being used for training, there was a bias in the positive and negative samples chosen such that you could predict activity of a ligand/protein binding using *ligand information alone.* This shouldn't be possible if we were sampling in an unbiased way over possible ligand/protein pairs, since even ligands that are quite effective with one protein will fail to bind with another, and it shouldn't be informationally possible to distinguish the two cases without protein information. Furthermore, a random forest on hand-designed features was able to perform comparably to deep learning, suggesting that only simple features are necessary to perform the task on this (bias and thus over-simplified) Specifically, they found that L+LP models did no better than models of L alone on the virtual screening task. However, the binding mode prediction task offered an interesting contrast, in that, on this task, it's impossible to predict the output from ligand information alone, because by construction each ligand will have some set of binding modes that are not the empirically correct one, and one that is, and you can't distinguish between these based on ligand information alone, without looking at the actual protein relationship under consideration. In this case, the LP network did quite well, suggesting that deep learning is able to learn from ligand-protein interactions when it's incentivized to do so. Interestingly, the authors were only able to improve on the baseline model by incorporating the rank output by the original docking program, which you can think of an ensemble of sorts between the docking program and the machine learning model. Overall, the authors' takeaways from this paper were that (1) we need to be more careful about constructing datasets, so as to not leak information through biases, and (2) that graph convolutional models are able to perform well, but (3) seem to be capturing different things than physics-based models, since ensembling the two together provides marginal value. |