[link]
Summary by Gavin Gray 4 years ago
A common problem in phylogenetics is:
1. I have $p(\text{DNA sequences} | \text{tree})$ and $p(\text{tree})$.
2. I've used these to run an MCMC algorithm and generate many (approximate) samples from $p(\text{tree} | \text{DNA sequences})$.
3. I want to evaluate $p(\text{tree} | \text{DNA sequences})$.
The first solution you might think of is to add up how many times you saw each *tree topology* and divide by the total number of MCMC samples; referred to in this paper as *simple sample relative frequencies* (SRF). An obvious problem with this method is that if you didn't happen to sample a tree topology you will assign it zero probability.
This paper focuses on producing a better solution to this problem, by defining a distribution over trees that's easy to fit and provides support over the entire space of tree topologies.
What is a Subsplit Bayesian Network?
==============================
Phylogenies are leaf-labeled bifurcating trees, binary trees with labeled leaves. **Clades** are nonempty subsets of the leaf labels; labeled in the following figure as $C_1 \to C_7$:
https://i.imgur.com/khS3uSo.png
To build a phylogeny, I can just take the full set of leaves and recursively split them into clades as described in the above diagram. This means that the distribution over phylogenies is equivalent to the distribution over clades. To make this distribution tractable, it is typically assumed that clades are conditionally independent given parent clades; called the *Conditional Clade Distribution* (CCD) assumption, attributed here to [Larget][] and [Hohna][].
So, a phylogeny may be described by its clades, this paper proposes that these clades may be described by their **subsplits**; the splitting process placing leaf nodes into one of two clades. Then, the authors note this process is a directed Bayesian network and that this Bayesian network must include all possible clade splits. It is therefore a complete binary tree with depth $N-1$ (where $N$ is the number of leaf nodes).
Fitting This Parameterisation to MCMC Samples
=====================================
Under this parameterisation the likelihood of observing a given tree is:
$$
p(\text{tree}) = p(S_1 = s_{1,k}) \prod_{i>1} p(S_i = s_{i,k} | S_{\pi_i} = s_{\pi_i, k}),
$$
assuming a collection of subsplits $T_k = \{ S_i = s_{i,k}, i \geq 1 \}$. In this definition $S_{\pi_i}$ is index set of parent nodes of $S_i$; ie a subsplit can depend on any of the parent nodes.
The maximum likelihood probabilities for possible subsplits can be solved in closed form; algorithmically involving counting the occurrences of subsplits and dividing by the number of trees observed. If this seems exactly the same as SRF, that's because it is; I [checked the published code to verify this][sbn_ds1].
The authors then consider the case of unrooted trees, where the log likelihood of observed trees can't be easily factorised. They then present a [simple averaging][sa] (not sure where this variational method is discussed in that paper, appears to be under a different name?) and an EM algorithm to fit SBN models over such distributions. They also discuss conditional probability sharing (parameterising conditional on parent-child relationships) immediately before this and it's not clear if this is used in the distribution fit by SA or EM.
Experiments show the distributions fit using the EM algorithm perform well according to KL divergence from a "true" distribution defined by fitting a distribution using SRF to a larger dataset. They also show less dispersion estimating the probability of sampled trees versus that estimated by this "ground truth".
[sa]: https://people.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_FTML.pdf
[sbn_ds1]: https://github.com/morrislab/sbn/blob/master/experiments/DS1.ipynb
[larget]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3676676/
[hohna]: https://academic.oup.com/sysbio/article/61/1/1/1676649

more
less