[link]
Summary by CodyWild 4 years ago
In January of this year (2020), DeepMind released a model called AlphaFold, which uses convolutional networks atop sequence-based and evolutionary features to predict protein folding structure. In particular, their model was designed to predict a distribution for how far away each pair of amino acids will be from one another in the final folded structure. Given such a trained model, you can score a candidate structure according to how likely it is under the model, and - if your process for generating candidates is differentiable, as it is in this case - you can directly optimize the structure to increase its probability.
https://i.imgur.com/9ZBhqRo.png
The distance-prediction model takes as input two main categories of feature:
1. Per-residue features characterizing which amino acid that residue is based on different techniques that produce one-hot amino acid type, or a distribution over amino acids.
2. Residue pair features based on parameters of Multiple Sequence Alignment (MSA) models. I don't deeply understand the details of how the specific models here work, but at a high level: MSA features are based on the evolutionary intuition that residues that make contact within a protein will likely evolve in a correlation with one another, and that you can simulate these evolutionary timestep correlations by comparing highly similar proteins (which were likely close in evolutionary time).
https://i.imgur.com/h16lPwU.png
These features are stacked in a LxL grid, with the per-residue-pair features differing at each point in the grid, and the per-residue features staying constant for a full row or column (since they correspond to a given i for all j). One relevant note here is that proteins can be hundreds or thousands of residues long, and, so you can't actually construct a full LxL matrix, either on the input or output end. Instead, the notional full LxL grid is subdivided into a courser grid of 64-residue square regions, and a single one of these 64x64 regions (which could be either adjacent or far away in the protein) is passed into the model at a time.
Given these 64x64x<features> input, the model performs several layers of dilated convolutions - which allow features at a given point in the grid to be informed by information farther away - still in a 2D arrangement. The model then outputs a 64x64 grid (one element for each [i, j] amino acid pair), where each element in the grid is a 64-deep discretized probability distribution over the distance between those two residues. When I say "discretized probability distribution," what I actually mean is "histogram". This discretization of an output distribution, where you predict how much probability mass will be in each possible distance bin, allows for flexible and finer-grained predicted distributions than, for example, you could get with a continuous Gaussian model centered around a single point. Amusingly, because the model predicts distance histograms for each residue pair, the authors term the output a "distogram". During training, the next-to-last layer of the model is also used to predict per-residue auxiliary features: the accessible surface area of the residue in the folded structure, and the secondary structure type (helix, strand, etc) that the residue will be a part of. However, these are just used to provide more signal during training, and aren't used for either protein structure optimization or calculation of test scores.
To actually generated predicted fold structures, the authors construct a generative model of fold structure where each amino acid is assigned two torsion angles that govern its connection to its neighbor. By setting these torsion angles to different values, you can twist and reshape the protein as a whole. Given this generative model, things proceed as you might suspect: you generate a candidate, calculate the resulting inter-residue distances, calculate a likelihood of those distances under the model you've learned, and send back a gradient to change your torsion angles to make that likelihood higher.
Empirically, the Deepmind authors evaluated on a competition dataset, and specifically compared themselves against other approaches that (like theirs) didn't make predictions for a new protein by comparing against similar templates (Template Modeling, or TM) but instead modeled from raw features (Free Modeling, or FM). AlphaFold was able to achieve high accuracy on 24 out of the 43 test domains (where a domain is a cluster of highly related proteins) compared to the next best method, which only got 14 out of the 43. Definitely still not perfect, since almost half of the test proteins were out of its reach, but fairly compelling evidence that there's value to DeepMind's approach.
more
less