Molecular Phylogeny Is a Mess of Uncertainty
Genomes galore – a great opportunity to study evolution, right? Think again. A paper in Science by Wong et al1 revealed systematic uncertainty in the way genomes are compared, leading to bias that makes genetic comparisons essentially useless. Antonis Rokas, in the same issue,2 began his commentary on this problem thus:
Darwin relied on fossils, morphology, and geographical distribution to glean important clues about the history of life. Today, natural historians can study organisms’ history of change and adaptation by probing the DNA record. Whether to elucidate evolutionary relationships of genes and species or spot the amino acid changes driven by selection, we need to be able to generate accurate alignments of DNA sequences. On page 473 of this issue, Wong et al.1 provide some important caveats on how this can go awry and how to avoid alignment bias.
Rokas continued with a folksy explanation of the basic problem:
For years, the standard protocol has been to pick a favorite algorithm to optimize the alignment it generates. This approach is fast and easy, but it is like being forced to always settle on vanilla ice cream for dessert; doing so can taint one’s opinion about ice cream. Similarly, sticking to the use of a single alignment from a single algorithm can bias the estimation of phylogenies or of other evolutionary parameters pivotal to our understanding of the DNA record. Until now, the extent and potential significance of this bias introduced by alignment was unknown. Wong and colleagues quantify the contribution of alignment uncertainty to genome-wide evolutionary analyses and report that we sweep this uncertainty under the proverbial rug at our peril.
Wong and team used seven popular programs to compare seven genomes. “The term ‘popular’ is not used lightly here,” Rokas notes; “these programs have been employed, judging by citation counts, in at least 25,000 analyses.” The potential for revision, therefore is enormous. What did the researchers find?
They report that a staggering 46.2% of the genes examined exhibit variation in the phylogeny produced dependent on the choice of alignment method, whereas the prediction of the amino acid changes driven by selection was likewise method dependent for another 28.4% of the genes.
The significance of this “whoops” admission cannot be overstated. For years, evolutionary biologists have depended on the “popular” algorithms to generate phylogenetic trees, expecting their results to be reliable. Rokas explains that high “bootstrap” values for some trees (a popular index that is supposed to measure robustness in inference) can be misleading, because “bootstrap values do not always equate with phylogenetic accuracy.” But if the bootstrap value is strong, what is in error – the signal or the phylogenetic inference? Rokas did not explore the latter possibility.
Wong et al explain how researchers can fall into the trap by trusting algorithms that cannot bear the weight of inference placed on them:
A common theme in comparative genomics studies is a flow diagram, or chart, tracing the various steps and algorithms used during the analysis of a large number of genes. Flow charts can be quite sophisticated, with steps such as identifying orthologous gene sets, aligning the genes, and performing different statistical analyses on the resulting alignments. The key point, and a great practical difficulty in comparative genomics studies, is that the analyses must be repeated many times. The procedure, then, is largely automated, with scripting languages such as Perl or Python cobbling together individual programs that perform each step. In addition, many of the individual steps involve procedures originally developed in the evolutionary biology literature, to perform phylogeny estimation or to identify individual amino acid residues under the influence of positive selection. Statistical methods that until recently would have been applied to a single alignment, carefully constructed, are now applied to a large number of alignments, many of which may be of uncertain quality and cause the underlying assumptions of the methods to fail.
This seems to indicate another problem: the very algorithms trusted were written on the assumption of evolution. Is there a circularity here? Will the algorithm select the data that will produce the expected evolutionary result? They did not elaborate.
The authors state that the uncertainty is not just a matter of sloppy analysis. A biologist may run the program with great care and precision. It’s trusting the algorithms themselves, and being unaware of the uncertainties, that leads to huge errors and false conclusions. They explain how this can happen:
Many comparative genomics studies are carefully performed and reasonable in design. However, even carefully designed and carried out analyses can suffer from these types of problems because the methods used in the analysis of the genomic data do not properly accommodate alignment uncertainty in the first place. Moreover, the genes that are of greatest interest to the evolutionary biologist probably suffer disproportionately. For example, in several studies, the genes of greatest interest were the ones that had diverged most in their nonsynonymous rate of substitution. But, these are the very genes that should be the most difficult to align in the first place. We also do not believe that the alignment uncertainty problem is one that can be resolved by simply throwing away genes, or portions of genes, for which alignment differs.
In fact, throwing out portions that have ambiguous alignments can lead to other problems, such as removing a large portion of the primary data. It also does not guarantee the remainder will line up well.
Rokas has a good-news-bad-news story. On the hopeful side, “several novel statistical methods that simultaneously estimate alignment and evolutionary parameters of interest such as phylogeny have shown exceptional promise,” he said. The bad news is there’s a catch: “The computational demands of these programs are prohibitive.”
Wong et al suggested some ways to mitigate alignment bias. No matter the quality control used, though, carefulness is not going to solve all the problems. “The goal is to analyze all of the genes in the genome,” they said. “As we have shown here, many of these genes will be difficult to align and result in highly variable evolutionary parameter estimates.” They did not seem to explore the possibility of circular reasoning in the algorithms.
Wow. This is going to be a shattering revelation to many a biologist. Rokas put the best possible spin on a bad situation:
As in any scientific field, molecular evolution has a long tradition of dramatic transformation. The development of a powerful computational and statistical arsenal to account for the uncertainty stemming from sequence alignments is heralding the first paradigm shift in the era of genome-scale analysis.
Now, the question is what to do about the 25,000 erroneous papers, and how long it will take to overcome the inertia of thousands of scientists continuing to use the popular algorithms oblivious to their inherent uncertainties.
1. Wong, Suchard and Huelsenbeck, “Alignment Uncertainty and Genomic Analysis,” Science, 25 January 2008: Vol. 319. no. 5862, pp. 473-476, DOI: 10.1126/science.1151532.
2. Antonis Rokas, “Lining Up to Avoid Bias,” Science, 25 January 2008: Vol. 319. no. 5862, pp. 416-417, DOI: 10.1126/science.1153156..
This shouldn’t be news. A team of scientists reported six years ago that building phylogenetic trees with any realistic measure of reliability was mathematically impossible (07/25/2002). Evolutionary biologists have to make assumptions and take shortcuts to get results. Because the algorithms are built on evolutionary assumptions (e.g., what constitutes positive selection, or what constitutes maximum likelihood or a parsimonious solution), the whole exercise is circular. Don’t think for a minute that a computer program built by evolutionists for evolutionists is going to generate bias-free, objective, neutral “facts of science.” They is a-huntin’ for Darwin’s trees, and Darwin’s trees is what they gonna get.
This paper is not likely to make much of a dent. Life will go on, because “tree-thinking” is inscribed with an iron stylus on the evolutionary biologist’s brain (11/14/2005). It influences everything he thinks and does. Besides, the importance of bashing down the creationists with mountains and mountains of scientific evidence for evolution is too important for a little bit of error, say 75% or more, to hinder the mission. With Darwin Day coming, the show must go on!