Tree of Life in the Genes? Not Yet
Now that we have hundreds of animal genomes in the bank (the GenBank), is Darwin’s tree of life becoming visible? If the image is present, it is extremely weak, said Michael J. Sanderson of the Department of Ecology and Evolutionary Biology at University of Arizona. Writing for Science,1 he showed that only a small fraction of genomes show even minimal support for a phylogenetic (evolutionary) tree.
His report was accompanied by a circle diagram with 876 taxonomic orders represented by small rectangles along the rim. He shaded blue those that contained a minimal phylogenetic signal, and yellow those that did not. The entire circle was almost all yellow. One has to look hard for blue rectangles. This is after “improvements in algorithms and high-performance computing technology have dramatically increased the scale of feasible phylogenetic inference; and unconventional sources of data, including whole genomes, expressed sequence tag libraries, and barcode sequences, have altered the landscape of large-scale phylogenetics with an infusion of new evidence.” The distribution of species in GenBank (the database of gene sequences) is remarkably broad, he said. If there was ever a time to see Darwin’s tree of life come to light in the genes, it should be now.
In light of the flood of evidence, how can the phylogenetic signal be so weak? “Construction of a high-resolution phylogenetic tree containing all eukaryotic species in the database is a grand challenge that is substantially more tractable than inferring the entire tree of life, but to succeed, strategies will have to overcome serious sampling impediments,” he said. “Quantifying the distribution and strength of phylogenetic evidence currently in the database is a prerequisite for this effort.” So that’s what he set out to do. And that’s what turned out to look pretty weak.
Sanderson looked at 1127 higher taxa for evidence of a phylogenetic signal. He had to set his standards pretty low. He figured if there were at least four operational taxonomic units [OTUs] that were similar between two taxa, for instance, then an evolutionary relationship could be inferred. His choice of tree-building software also was rigged to produce a “fast but conservative” result. “Any clade in the resulting tree will have had at least 50% bootstrap support in maximum parsimony ‘fast’ bootstrap analyses with two different sequence alignment algorithms,” he explained.2 “Although this protocol biases the confidence assessment slightly downward, the bias is small.” Is that a matter of human opinion?
There were more hints the standards were loose. “For comparative purposes and to aid in the visualization of results, an arbitrary cutoff value of 1.5 was selected as minimal phylogenetic support,” he continued. “This is equivalent, for example, to the information content of two independent loci, each resolving three-quarters of clades to at least a bootstrap value of 51%.” This sounds close to the tipping point for inferring no relationship at all.
After manipulating his protocols, summing, and averaging, the evolutionary signal came out surprisingly low, even with the loose standards. Here is the upshot:
Among individual OTUs [operational taxonomic units], Homo sapiens had the maximum support value of 293.9, but the distribution of scores had a long tail leading to 6402 OTUs with no support at all (most of which, 6079, simply were not found in any phylogenetically informative clusters). The top 10 were all mammals; the top 25 were mammals, angiosperms (tomato, potato, tobacco, rice, and wheat), Drosophila melanogaster, and Drosophila simulans, all with support scores above 60 units. Of the 171,703 OTUs for which scores were calculated, only 12% achieved minimal phylogenetic support. The mean support was 0.84, less than the equivalent of each taxon being found in at least one well-resolved and -supported phylogenetic tree.
So only 12% reached the already-low bar for evolutionary signal – that means 88% did not. At the level of orders, the scores were skewed even lower. The maximum score was 10 in primates, and 0.0 in 75 other orders. He tried to draw an inference between orders that were species-rich and species poor, but many of the orders outside of primates and arthropods did not even reach minimal phylogenetic support regardless of species richness.
So what did Sanderson conclude from his investigation of the strength of the signal of Darwin’s tree of life in the genes? Basically, he said more work is needed. “An accurate high-resolution phylogeny will require substantial increases in sequence data to bring that score to a level comparable to that of the best-supported higher taxa.” He thinks more data targeted at the right clusters of genes might help. Better algorithms in the tree-building software might help, too. Maybe the signal will become clearer when genes from undiscovered species in poorly-resolved branches become available. “In the meantime, sampling protocols guided by quantitative assessments of the phylogenetic distribution of data will improve the efficiency of emerging phylogenomic strategies for building the tree of life of known organisms.” Translated, this almost sounds like he is claiming that better data-massaging methods might just begin to help develop strategies for beginning to find ways to begin to visualize Darwin’s tree. In colloquial terms, it’s going to take a lot of work to fix this picture.
1. Michael J. Sanderson, “Phylogenetic Signal in the Eukaryotic Tree of Life,” Science, 4 July 2008: Vol. 321. no. 5885, pp. 121-123, DOI: 10.1126/science.1154449.
2. For more on the meanings of bootstrap, maximum parsimony and other phylogenetic tree-building terms, see the entries from 04/26/2008, 01/26/2008, 03/30/2004, 10/15/2003, and 11/06/2002.
Charlie’s hanging from his own tree. Why give him more rope? It will only make the carcass horizontal instead of vertical.