Background
The traditional view of life on earth has been that all living organisms may be organised into the so-called "Tree of Life" (or universal phylogenetic tree), wherein species are related to each other by their relative position on the tree. It assumes that the evolution of species occurs via vertical inheritance - organisms pass their genes only to their offspring, thus organisms with similar genes may have descended from a more recent common ancestor, and will be positioned closer on the tree. Organisms which are further away from each other on the tree may have descended from a much earlier common ancestor, and so may have fewer genes in common.
|
|
| Figure 1: An example of the universal phylogenetic tree, based on 16S rDNA sequences. |
Lateral Genetic Transfer (LGT) is the process in which an organism transmits genetic material (i.e. DNA) to another organism which is not its offspring. It is most common among bacteria, and may occur in a number of ways, such as through bacterial viruses, or through direct contact. For instance, two organisms in different environments will have different genes, which allow them to survive in their individual environments. Through LGT one organism may pass its genes to the other, allowing it survive in the alternative environment. Thus in the phylogenetic tree for the gene involved in the transfer, these two organisms will be closely related, even though they may share very few other genes. This means that the phylogenetic trees for single genes or proteins of a particular group of species may differ quite significantly from other trees for the same group of species.
It is largely through LGT mechanisms that bacteria have been able to develop resistance to antibiotics. LGT also has implications for phenomena such as bioremediation and climate change. A team of researchers led by Professor Mark Ragan of the Institute of Molecular Biosciences at the University of Queensland is seeking to understand on a broad scale how often and between what types of organisms LGT may occur.
The Project
For their research, Professor Ragan and Dr Robert Beiko started with almost 423,000 predicted proteins from a set of 144 prokaryotic organisms. Using protein cluster methods such as BLAST, they identified the families of proteins that appeared to be evolutionarily related, resulting in a set of 22,437 MRCs (maximally representative clusters, families of proteins thought to be related by evolution, in which no genome is represented more than once) containing 220,240 sequences.
The group then performed multiple sequence alignment on these sets. Each protein is built from 20 different amino acid residues, and can be of any length. These proteins can be subject to insertion and deletion events, wherein the protein may have sections inserted or removed. Alignment methods are required to identify and model these insertion/deletion events. This multiple sequence alignment allows the evolutionary distances between pairs of proteins to be determined.
The alignment problem is NP hard - exact alignment methods are impossible for more than about 10 sequences, so given the huge number of sequences involved in this study, exact alignment methods were out of the question. This meant that approximations had to be used - these methods are still quite computationally intensive, and required the use of APAC and QCIF HPC facilities at ANU and UQ. Rather than using just one method of multiple sequence alignment, the group attacked the problem in several different ways, to ensure the best possible result. Removing ambiguously aligned regions yielded 22,432 alignment sets, ranging in size from 4 to 144 sequences.
Phylogenetic Inference
Following this alignment, the group started building individual phylogenetic trees for each protein. This step - known as phylogenetic inference - involved searching through the entire space of possible phylogenetic trees for the set for ones which have a high likelihood given the input model. Each possible tree represents a different way of pairing off the sequences in sistered relationships - thus as the number of sequences in the set grows, the tree space quickly becomes enormous. For a set of 4 protein sequences there are 3 possible trees. For a set of 50 sequences there are more possible trees than there are particles in the universe - thus an exhaustive search of every possible tree was not possible.
This meant that a number of different methods had to be used. One such method makes use of the Bayesian sampling software MrBayes, which finds regions of tree space which have a particularly high likelihood. In 2004, the group ran benchmarking tests using representative data sets on QCIF supercomputers at UQ, in order to assess the stability of a number of different methods and to determine the amount of time that would be required for most data sets. This was an extremely important part of their research, and would not have been possible without QCIF resources. Results of this were published in early 20062.
From this benchmarking test the group determined that a replication strategy (usually running the computations 10 to 15 times each) was most suitable for building the phylogenetic trees for the larger families of proteins. This phylogenetic inference step was accomplished using a combination of APAC and QCIF resources, and was estimated to have taken around 400,000 CPU hours - to date it is still the largest biologically inspired computation undertaken in Australia.
Tree Reconciliation
Once these individual protein trees had been built, they were compared with a reference supertree - built from over 22,000 orthologous protein families. This tree essentially described the evolution of the organisms if genetic transfer had occurred strictly through speciation events i.e. vertical descent. If there were significant differences between each protein tree and the reference tree, the group started looking at LGT as a way of reconciling these differences. Finding the LGT events involves breaking the reference tree and moving the pieces around until you get something which is consistent with the protein tree - a process known as subtree prune-and-regraft (SPR). Searching through the entire set of prune-and-regraft operations (known as edit paths) has been proven to be NP hard, so Dr Beiko developed his own SPR algorithm to recover the edit paths efficiently - in some cases this still required up to 20 hours and 10 GB of RAM, and in other cases the problem proved too difficult, it was simply impossible to reconcile the trees. However, minimal edit paths were able to be computed exactly for 19,351 protein families, with edit distances ranging from 1 to 22. These edit paths correspond to the probable pathways of gene sharing.
Results from this study were published in late 20051, in a paper rated a "Must Read" in Faculty of 1000.
For information on similar research in this field, see Orthology and Paralogy in Eukaryotes.
Contacts
Dr Robert Beiko, Professor Mark Ragan
Institute for Molecular
Biosciences, University of Queensland
Publications
- Beiko RG, Harlow TJ, and Ragan MA, 'Highways of gene sharing in prokaryotes', Proc Natl Acad Sci USA. 2005 Oct 4, 102(40):14332-7.
- Beiko RG, Keith JM, Harlow TJ, and Ragan MA, 'Searching for convergence in Phylogenetic Markov Chain Monte Carlo', Syst Biol. 2006 Aug, 55(4):553-65.
- Beiko RG, and Hamilton, N, 'Phylogenetic identification of lateral genetic transfer events', BMC Evol Biol. 2006 Feb 11. 6:15
Written by T. Curtis, August 2006
