Minh Duc Cao,
Trevor I. Dix and
J. Molecular Biology and Evolution, 33(5), pp.1349-1357, Jan. 2016,
Methods for measuring genetic distances in phylogenetics are
known to be sensitive to the evolutionary model assumed.
However, there is a lack of established methodology to accommodate
the trade-off between incorporating sufficient biological reality and
avoiding model overfitting.
In addition, as traditional methods measure distances based on
the observed number of substitutions, their tend to underestimate distances
between diverged sequences due to backward and parallel substitutions.
Various techniques were proposed to correct this,
but they lack the robustness against sequences that are
distantly related and of unequal base frequencies.
In this article, we present a novel genetic distance estimate based
on information theory that overcomes the above two hurdles.
Instead of examining the observed number of substitutions, this method
estimates genetic distances using Shannon's mutual information.
This naturally provides an effective framework for balancing model
complexity and goodness of fit.
Our distance estimate is shown to be approximately linear to
elapsed time and hence is less sensitive to the divergence of
sequence data and compositional biased sequences.
Using extensive simulation data, we show that our method
1) consistently reconstructs more accurate phylogeny topologies than
2) is robust in extreme conditions such as diverged phylogenies,
unequal base frequencies data, and heterogeneous mutation patterns, and
3) scales well with large phylogenies.