19 dubious ways to compute the marginal likelihood of a phylogenetic tree topology
Mathieu Fourment
and
Andrew F. Magee
and
Chris Whidden
and
Arman Bilge
and
Frederick A. Matsen IV
and
Vladimir N. Minin
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
q-bio.PE, stat.CO
First published: 2018/11/28 (6 years ago) Abstract: The marginal likelihood of a model is a key quantity for assessing the
evidence provided by the data in support of a model. The marginal likelihood is
the normalizing constant for the posterior density, obtained by integrating the
product of the likelihood and the prior with respect to model parameters. Thus,
the computational burden of computing the marginal likelihood scales with the
dimension of the parameter space. In phylogenetics, where we work with tree
topologies that are high-dimensional models, standard approaches to computing
marginal likelihoods are very slow. Here we study methods to quickly compute
the marginal likelihood of a single fixed tree topology. We benchmark the speed
and accuracy of 19 different methods to compute the marginal likelihood of
phylogenetic topologies on a suite of real datasets. These methods include
several new ones that we develop explicitly to solve this problem, as well as
existing algorithms that we apply to phylogenetic models for the first time.
Altogether, our results show that the accuracy of these methods varies widely,
and that accuracy does not necessarily correlate with computational burden. Our
newly developed methods are orders of magnitude faster than standard
approaches, and in some cases, their accuracy rivals the best established
estimators.
This paper compares methods to calculate the marginal likelihood, $p(D | \tau)$, when you have a tree topology $\tau$ and some data $D$ and you need to marginalise over the possible branch lengths $\mathbf{\theta}$ in the process of Bayesian inference. In other words, solving the following integral:
$$
\int_{ [ 0, \infty ]^{2S - 3} } p(D | \mathbf{\theta}, \tau ) p( \mathbf{\theta} | \tau) d \mathbf{\theta}
$$
There are some details about this problem that are common to phylogenetic problems, such as an exponential prior on the branch lengths, but otherwise this is the common problem of approximate Bayesian inference. This paper compares the following methods:
* ELBO (appears to be [BBVI][])
* Gamma Laplus Importance Sampling
* Variational Bayes Importance Sampling
* Beta' Laplus
* Gamma Laplus
* Maximum un-normalized posterior probability
* Maximum likelihood
* Naive Monte Carlo
* Bridge Sampling
* Conditional Predictive Ordinates
* Harmonic Mean
* Stabilized Harmonic Mean
* Nested Sampling
* Pointwise Predictive Density
* Path Sampling
* Modified Path Sampling
* Stepping Stone
* Generalized Stepping Stone
I leave the in depth description of each algorithm to the paper and appendices, although it's worth mentioning that Laplus is a Laplace approximation where the approximating distribution is constrained to be positive.
Some takeaways from the empirical results:
* If runtime is not a concern power posterior methods are preferred:
> The power posterior methods remain the best general-purpose tools for phylogenetic modelcomparisons, though they are certainly too slow to explore the tree space produced by PT.
* Bridge sampling is the next choice, if you need something faster.
* Harmonic Mean is a bad estimator for phylogenetic tree problems.
* Gamma Laplus is a good fast option.
* Naive Monte Carlo is a poor estimator, which is probably to be expected.
* Gamma Laplus is the best option for very fast algorithms:
> Empirical posterior distributions on branch lengths are clearly not point-masses, and yet simply normalizing the unnormalized posterior at the maximum outperforms 6 of the 19 tested methods.
All methods were compared on metrics important to phylogenetic inference, such as *average standard deviation of split frequencies" (ASDSF), which is typically used to confirm whether parallel MCMC chains are sampling from the same distribution over tree topologies. Methods were also compared on KL divergence to the true posterior and RMSD (appears to be the mean squared error between CDFs?).
[bbvi]: https://arxiv.org/abs/1401.0118