Phylogenetic analysis, the estimation of evolutionary trees,
has become the cornerstone of evolutionary biology. In addition
to their more traditional applications in evolutionary biology,
molecular phylogenies (i.e., phylogenies that have been estimated
from molecular data such as DNA sequences) are being applied
to an ever-widening array of disciplines. These include biomedicine
(e.g., tracing infection pathways for HIV and other pathogens),
bioinformatics (e.g., genome evolution), and forensics (phylogenies
estimated from HIV sequences have recently been allowed as evidence
in murder trial). Because of this, the development and testing
of phylogenetic methods assumes a position of critical importance
and extremely broad relevance. Furthermore, the influx of molecular
sequence data and the adoption of an explicitly statistical approach
to data analysis have led to the requirement to refine methods
of phylogenetic inference.
Recent work in phylogenetic methods has centered on heuristic
approaches to model-based phylogeny estimation.
Our understanding of the mechanisms of nucleotide substitution
(DNA sequence evolution) has been expanding greatly over the
last 15 years. Furthermore, it has become apparent that ignoring
such processes as heterogeneity of base composition, substitution
pattern, and rate variation among nucleotide sites can compromise
attempts to estimate phylogeny from DNA sequence data. Therefore,
model-based analyses of DNA sequence data have become increasingly
wide spread because this approach affords the investigator the
opportunity to account for such processes explicitly in phylogenetic
estimation. This is especially true when a maximum-likelihood
framework is adopted. Two problems related to model-based analyses
that I am addressing are computational intensity and the influence
of violation of model assumptions on the accuracy phylogenetic
inference.
First, model-based methods require vast computational power.
Thus, given the moderate computational capacities most researchers
face, it is critical to develop approximate methods of model-based
analyses that may be applied to relatively large data sets. During
my postdoctoral research (and as an extension of my doctoral
research), Dr.
David Swofford and I developed an approximate method, based
on a successive approximations strategy. While this method appears
to be very useful and therefore has become widely adopted, it
has only been tested recently. We have devoted emormous CPU time
(using the Bioinformatics
Core Facility) to demonstrate that, as long as heuristic
searches are reasonably rigorous, the iterative serch strategy
works well (Sullivan et al., 2005).
A perl script is available that automates the successive-approximations
approach.
Second, although increasingly realistic models of sequence evolution
have recently been developed and are easily implementable (e.g.
various heterogeneous rates models), even the currently most
general and complex models are certainly incorrect. For example,
most models of sequence evolution used for phylogenetic estimation
ignore intermolecular interactions. Although increasingly realistic
and complex models could be developed to fit data better, such
models might require so many parameters be estimated from data
as to render them non-implementable. Thus, have examined the
relationship between model fit (in absolute rather than relative
terms) and accuracy of phylogenetic estimation (Sullivan
and Swofford, 2001). Many aspects of the model used for
analysis may be incorrect without compromising the accuracy of
phylogenetic estimation. For example rate heterogeneity does
not have to be modeled accurately as long as something is done
to account for among-site rate variation. Thus, use of less computationally
intensive approximate models that are statistically rejectable
may provide equally accurate estimates
of phylogenies as better fitting continuous rates models.
This realization hs led us to develop a novel approach for
selection of models for ML and Bayesian estimation of phylogeneis
from DNA seqeunce data. This approach (Minin
et al., 2003) incorporates decision theory to permit an evaluation
of the expected performace of each candidate model into model
choice, and is implemented in the program DT-ModSel.
We've demnstrate that the method selects models that are simpler
on average than othe automated model selection apporaches, yet
these simpler models (that are rejectable based on fit alone)
provide phylogenies that are at least as accurate as those provided
by the more complicated models (Abdo et
al., 2005). We will continue with this project.
|