Sullivan Lab: Current Projects : Phylogenetic Methods

Phylogenetic Methods

Phylogenetic analysis, the estimation of evolutionary trees, has become the cornerstone of evolutionary biology. In addition to their more traditional applications in evolutionary biology, molecular phylogenies (i.e., phylogenies that have been estimated from molecular data such as DNA sequences) are being applied to an ever-widening array of disciplines. These include biomedicine (e.g., tracing infection pathways for HIV and other pathogens), bioinformatics (e.g., genome evolution), and forensics (phylogenies estimated from HIV sequences have recently been allowed as evidence in murder trial). Because of this, the development and testing of phylogenetic methods assumes a position of critical importance and extremely broad relevance. Furthermore, the influx of molecular sequence data and the adoption of an explicitly statistical approach to data analysis have led to the requirement to refine methods of phylogenetic inference.

Recent work in phylogenetic methods has centered on heuristic approaches to model-based phylogeny estimation.

Our understanding of the mechanisms of nucleotide substitution (DNA sequence evolution) has been expanding greatly over the last 15 years. Furthermore, it has become apparent that ignoring such processes as heterogeneity of base composition, substitution pattern, and rate variation among nucleotide sites can compromise attempts to estimate phylogeny from DNA sequence data. Therefore, model-based analyses of DNA sequence data have become increasingly wide spread because this approach affords the investigator the opportunity to account for such processes explicitly in phylogenetic estimation. This is especially true when a maximum-likelihood framework is adopted. Two problems related to model-based analyses that I am addressing are computational intensity and the influence of violation of model assumptions on the accuracy phylogenetic inference.

First, model-based methods require vast computational power. Thus, given the moderate computational capacities most researchers face, it is critical to develop approximate methods of model-based analyses that may be applied to relatively large data sets. During my postdoctoral research (and as an extension of my doctoral research), Dr. David Swofford and I developed an approximate method, based on a successive approximations strategy. While this method appears to be very useful and therefore has become widely adopted, it has only been tested recently. We have devoted emormous CPU time (using the Bioinformatics Core Facility) to demonstrate that, as long as heuristic searches are reasonably rigorous, the iterative serch strategy works well (Sullivan et al., 2005). A perl script is available that automates the successive-approximations approach.

Second, although increasingly realistic models of sequence evolution have recently been developed and are easily implementable (e.g. various heterogeneous rates models), even the currently most general and complex models are certainly incorrect. For example, most models of sequence evolution used for phylogenetic estimation ignore intermolecular interactions. Although increasingly realistic and complex models could be developed to fit data better, such models might require so many parameters be estimated from data as to render them non-implementable. Thus, have examined the relationship between model fit (in absolute rather than relative terms) and accuracy of phylogenetic estimation (Sullivan and Swofford, 2001). Many aspects of the model used for analysis may be incorrect without compromising the accuracy of phylogenetic estimation. For example rate heterogeneity does not have to be modeled accurately as long as something is done to account for among-site rate variation. Thus, use of less computationally intensive approximate models that are statistically rejectable may provide equally accurate estimates of phylogenies as better fitting continuous rates models.

This realization hs led us to develop a novel approach for selection of models for ML and Bayesian estimation of phylogeneis from DNA seqeunce data. This approach (Minin et al., 2003) incorporates decision theory to permit an evaluation of the expected performace of each candidate model into model choice, and is implemented in the program DT-ModSel. We've demnstrate that the method selects models that are simpler on average than othe automated model selection apporaches, yet these simpler models (that are rejectable based on fit alone) provide phylogenies that are at least as accurate as those provided by the more complicated models (Abdo et al., 2005). We will continue with this project.