[link]
My objective in reading this paper was to gain another perspective on, and thus a more well-grounded view of, machine learning scoring functions for docking-based prediction of ligand/protein binding affinity. As quick background context, these models are useful because many therapeutic compounds act by binding to a target protein, and it can be valuable to prioritize doing wet lab testing on compounds that are predicted to have a stronger binding affinity. Docking systems work by predicting the pose in which a compound (or ligand) would bind to a protein, and then scoring prospective poses based on how likely such a pose would be to have high binding affinity. It's important to note that there are two predictive components in such a pipeline, and thus two sources of potential error: the searching over possible binding poses, done by physics-based systems, and scoring of the affinity of a given pose, assuming that were actually the correct one. Therefore, in the second kind of modeling, which this paper focuses on, you take in features *of a particular binding pose*, which includes information like which atoms of the compound are nearby to which atoms of the protein. The actual neural network structure used here was admittedly a bit underwhelming (though, to be fair, many of the ideas it seems to be gesturing at wouldn't be properly formalized until Graph Convolutional Networks came around). I'll describe the network mechanically first, and then offer some commentary on the design choices. https://i.imgur.com/w9wKS10.png 1. For each atom (a) in the compound, a set of neighborhood features are defined. The neighborhood is based on two hyperparameters, one for "how many atoms from the protein should be included," and one for "how many atoms from the compound should be included". In both cases, you start by adding the closest atom from either the compound or protein, and as hyperparameter values of each increase, you add in farther-away atoms. The neighborhood features here are (i) What are the types of the atoms? (ii) What are the partial charges of the atoms? (iii) How far are the atoms from the reference atom? (iiii) What amino acid within the protein do the protein atoms come? 2. All of these features are turned into embeddings. Yes, all of them, even the ones (distance and charge) that are continuous values. Coming from a machine learning perspective, this is... pretty weird as a design choice. The authors straight-up discretize the distance values, and then use those as discrete values for the purpose of looking up embeddings. (So, you'd have one embedding vector for distance (0.25-0.5, and a different one for 0.0-0.25, say). 3. The embeddings are concatenated together into a single "atom neighborhood vector" based on a predetermined ordering of the neighbor atoms and their property vectors. We now have one atom neighborhood vector for each atom in the compound. 4. The authors then do what they call a convolution over the atom neighborhood vectors. But it doesn't act like a normal convolution in the sense of mixing information from nearby regions of atom space. It just is basically a fully connected layer that's applied to atom neighborhood vector separately, but with shared weights, so the same layer is applied to each neighborhood vector. They then do a feature-wise max pool across the layer-transformed version of neighborhood vectors, getting you one vector for the full compound 5. This single vector is then put into a softmax, which predicts whether this ligand (in in this particular pose) will have strong binding with the protein Some thoughts on what's going on here. First, I really don't have a good explanation for why they'd have needed to embed a discretized version of the continuous variables, and since they don't do an ablation test of that design choice, it's hard to know if it mattered. Second, it's interesting to see, in their "convolution" (which I think is more accurately described as a Siamese Network, since it's only convolution-like insofar as there are shared weights), the beginning intuitions of what would become Graph Convolutions. The authors knew that they needed methods to aggregate information from arbitrary numbers of atoms, and also that they need should learn representations that have visibility onto neighborhoods of atoms, rather than single ones, but they do so in an entirely hand-engineered way: manually specifying a fixed neighborhood and pulling in information from all those neighbors equally, in a big concatenated vector. By contrast, when Graph Convolutions come along, they act by defining a "message-passing" function for features to aggregate across graph edges (here: molecular bonds or binaries on being "near enough" to another atom), which similarly allows information to be combined across neighborhoods. And, then, the 'convolution' is basically just a simple aggregation: necessary because there's no canonical ordering of elements within a graph, so you need an order-agnostic aggregation like a sum or max pool. The authors find that their method is able to improve on the hand-designed scoring functions within the docking programs. However, they also find (similar to another paper I read recently) that their model is able to do quite well without even considering structural relationships of the binding pose with the protein, which suggests the dataset (DUD - a dataset of 40 proteins with ~4K correctly binding ligands, and ~35K ligands paired with proteins they don't bind to) and problem given to the model is too easy. It's also hard to tell how I should consider AUCs within this problem - it's one thing to be better than an existing method, but how much value do you get from a given unit of AUC improvement, when it comes to actually meaningfully reducing wetlab time used on testing compounds? I don't know that there's much to take away from this paper in terms of useful techniques, but it is interesting to see the evolution of ideas that would later be more cleanly formalized in other works.
Your comment:
|