Main

It is well documented that some genes evolve more quickly than others; for instance, in the human species, certain histone genes are highly conserved, whereas immunoglobulin loci are extremely polymorphic1. A lack of genetic variation might indicate the occurrence of purifying selection — a force that preserves the adapted condition and that is therefore typically observed in functionally important genes. By contrast, extensive variation in genes indicates that the encoded protein might benefit from undergoing amino-acid replacements. Such positive selection has been recently observed in genes that have an adaptive function. Until now, it has been difficult to link the patterns of molecular variation to the selective pressures responsible for them. However, in some systems, notably in viral species, sufficient sequence data now exist to test adaptive hypotheses directly using phylogenetic analysis.

Phylogenetic trees are a graphic means of reconstructing evolution on the basis of similarity between the characters of the individuals under study; the length of a horizontal branch on the tree reflects the amount of change between an individual and its nearest ancestor (Box 1). Evolutionary pressure on a gene or codon can be detected by comparing the rates of synonymous (silent) and non-synonymous (amino-acid changing, or non-silent) nucleotide substitutions across the branches of a tree. In the absence of selection, the synonymous and non-synonymous substitution rates should be equal (Fig. 1a). Most coding genes show an excess of synonymous substitutions, which indicates that purifying (stabilizing) selection is operating to preserve the current structure and function of the protein (Fig. 1b). Neutral or conserved substitution patterns provide limited insight into the evolutionary process, because the phylogenetic tree provides no additional information as to why the gene evolved in this manner.

Figure 1: Effects of selection on substitution rates.
figure 1

Non-synonymous (NS) and synonymous (S) nucleotide substitutions that typify three selective regimes: a | selective neutrality (NS = S), b | purifying (stabilizing) selection, which conserves the present sequence (NS < S), and c | positive selection, which favours amino-acid replacement (NS > S). Non-synonymous substitutions are represented by coloured dots and synonymous substitutions are indicated by black dots.

Much more interesting studies are possible when substitution rate analyses indicate the occurrence of positive selection (Fig. 1c). Positive selection is natural selection that favours amino-acid change. Continual positive selection leaves a characteristic pattern on a phylogenetic tree in the form of a greater rate of non-synonymous than synonymous substitution. Potentially, these trees can provide a great deal of additional information about the nature of adaptive change in a system. Typically, only a small number of codons per gene seem to be positively selected. In proteins of known structure, studying the effects of changing these particular residues might lend insight into the functional role of the protein. In proteins of unknown structure, knowing the location of positively selected residues in the two-dimensional structure provides a starting point to determine the three-dimensional structure of the protein, as these residues typically lie in positions exposed to external selective forces. In addition, we can test adaptive hypotheses by correlating change over time (across the tree) at the putative, positively selected codons with changes in phenotype or in fitness. Given a sufficient understanding of how a protein responds to selection in a particular system, it might be possible to predict its response to future selective challenges.

In this article, I outline the theoretical basis for the research into substitution rate analysis and summarize the biological systems in which evidence of positive selection has been detected. I discuss the cases in which predicting evolution might be realistic, along with some of the potential pitfalls encountered in this type of work. Last, I present some practical applications of substitution rate analysis: for epitope identification in vaccine design, for the determination of protein structure, and as a tool for interpreting the results of whole-genome sequencing projects.

Positive selection

The study of adaptive evolution using substitution rate analysis involves two basic steps. The first involves reconstructing the evolutionary history of a gene in the form of a phylogenetic tree. A tree depicts the changes that occur as sequences descend from a common ancestor (Box 1). In the second step, the tree is used to estimate the non-synonymous and synonymous nucleotide substitution rates over time (Box 2). A substitution rate might be calculated for the entire gene by summing substitutions across codons; however, with sufficient data, rates might also be estimated for each individual codon. Tests that sum substitutions across codons might fail to identify positively selected genes if high non-synonymous substitution rates occur at only a few codons. Despite this drawback, genic level studies have identified a number of putative, positively selected genes (see Ref. 2 for a comprehensive list). Most of these genes fall into two principal groups: pathogen surface proteins, and sperm proteins of aquatic animals that practice external fertilization.

The surface proteins of pathogens must change their three-dimensional structure to avoid recognition by antibodies that are raised in response to previous antigen exposure. Therefore, it is likely that evasion of the host immune system drives repeated amino-acid replacements in surface proteins. Indirect support for this assumption has been found: the codons in genes with a high non-synonymous substitution rate typically code for residues that are exposed on the surface of the pathogen. Examples include the porB gene of the gonorrhoea-inducing bacterium Neisseria gonorrhoea 3, which encodes protein channels in the lipopolysaccharide layer of Gram-negative bacteria, and the gp120 envelope gene of the human immunodeficiency virus (HIV-1)4. The same is true of haemagglutinin (HA), which, along with neuroaminadase (NA), is one of the most antigenic surface proteins of the influenza virus. Here, the positively selected residues lie on the surface of the protein, within known antibody-binding sites5.

Pathogens exert strong selective pressure on their hosts so, not surprisingly, there is evidence of positive selection from various host-defence systems. Human major histocompatibility complex (MHC) antigen-recognition sites seem to be positively selected6. Although plants use resistance genes and chitinases (enzymes that degrade fungal cell walls) for defence rather than the T cells, MHC and antibodies that are used by animals, surface proteins of plant pathogens show signs of positive selection7,8 as do plant-defence systems9. These studies suggest that host–pathogen systems involve exquisite matching between the pathogen and host receptors.

Proteins that are involved in the reproduction of externally fertilizing marine organisms provide another class of genes under positive selection. The two most extensively studied cases are the lysin gene of abalones (a shellfish)10 and the bindin gene in sea urchins14. Lysin, which is released from sperm at fertilization, dissolves the vitelline coat of the egg in a species-specific manner; bindin is a sperm acrosomal protein that mediates species-specific recognition and binding between the sperm and the egg after the sperm has penetrated the egg jelly. Enforcement of species-specific sperm recognition might be the principal selective force for change in bindin and lysin, as host specificity for sperm recognition requires correct matching of the sperm surface with receptors on the egg. Positively selected codons in abalone lysin lie on the surface of the molecule and are associated with structural features that are thought to be involved in binding13. In terms of the need for specific matching, this system shares many similarities with the host–pathogen studies above.

In summary, reasonable adaptive hypotheses have been proposed to explain how certain patterns of genetic change might have been produced by positive selection. But how do we test whether positive selection actually occurred?

Testing adaptive hypotheses

Positive selection produces an excess of non-synonymous substitutions on a phylogenetic tree. However, this excess alone is not sufficient evidence to invoke positive selection. Support for the hypothesis requires an increase in fitness caused by amino-acid replacements at the putative, positively selected sites. So far, there has been only one test of this hypothesis, using the gene for haemagglutinin, the principal surface antigen of the H3N2 subtype of human influenza A (H3N2 refers to the particular HA and NA gene variants that it contains).

Human influenza evolves so rapidly that vaccine strains must be updated almost yearly. Selection favours haemagglutinin variants that escape recognition by the antibodies that are formed in response to past infection or vaccination. New lineages of H3N2 influenza A that differ in their haemagglutinin arise frequently. As shown in Fig. 2, at any given time several closely related lineages co-circulate. For reasons that are not yet understood, all but a single lineage dies out within a few years. Relative fitness, the rate of increase of a genotype relative to other genotypes in a population, is thus unambiguous in this system.

Figure 2: Predicting evolution.
figure 2

This phylogenetic tree represents a simplified view of influenza A haemagglutinin evolution during a single year. New mutant lineages continually arise and then all but one become extinct. Dots indicate amino-acid replacements at codons known to have been positively selected in the past. In our studies, the single lineage that survives, shown here in bold, has typically undergone the greatest number of additional amino-acid replacements at the known positively selected codons.

We constructed phylogenetic trees that represented the evolution of the H3 human haemagglutinin gene through 11 successive influenza seasons. We found that lineages undergoing the greatest number of new amino-acid replacements at putative, positively selected codons were fitter than other lineages in 9 out of 11 recent influenza seasons5,18; that is, lineages with the most replacements would outcompete all others. These results support the hypothesis that replacement substitutions at positively selected codons more effectively changed the shape of the haemagglutinin with respect to antibody recognition than did substitutions at other codons.

Predicting evolution

In the previous section, I showed that positive selection can explain the high rate of non-synonymous versus synonymous substitution in human influenza haemagglutinin. In essence, these retrospective tests involved going back in time to see whether we could predict subsequent evolution in our 11 years of data. Our predictions were successful in 9 out of 11 years.

Our studies are based on the assumption that the selective pressure on influenza during our study period was directed towards avoiding immune recognition. We also assume that this selective pressure persists today and, based on this assumption, propose that circulating strains with the most additional mutations at these same positively selected codons at present will be the progenitors of future influenza lineages. It remains to be seen how well our hypothesis holds up.

Influenza is perhaps the only natural system at present available in which it is possible to try to predict evolution at the population level. This is due to three main factors. First, haemagglutinin evolves very rapidly, allowing us to observe change easily. Second, haemagglutinin is one of the best-studied genes in terms of positive selection5,18. One reason for this is the high quality of the available data. Sequences used in our work and that of many other recent studies were generated by the Influenza Branch of the US Centers for Disease Control and Prevention (CDC, see link) as part of the World Health Organization (WHO) influenza surveillance programme (see link). Sequences that date back to the 1968 emergence of the H3N2 subtype in humans are available, along with data on the date of collection and laboratory culture, through the Influenza Sequence Database at Los Alamos National Laboratory (see link). Third, prediction might be limited to influenza because of the unambiguous measure of fitness available in this system, which is assessed by the survival or extinction of a particular lineage.

The only system for which the wealth of sampling data approaches influenza is the human immunodeficiency virus (HIV), which also evolves rapidly. However, in contrast to influenza, many new mutant lineages of HIV survive at the population level, rather than just one. There is no clear means by which to compare the fitness of different HIV isolates in a population on a real-time basis. Linear replacement of HIV-1 strains over time has been seen in individual human hosts25, so it might be possible to predict its evolution in this limited context. In influenza, prediction has a direct clinical application in terms of vaccine strain selection. Unfortunately, the equivalent for HIV — developing vaccines for individual HIV-infected patients on the basis of evolution of the virus within their bodies — is not possible at this time.

Evolutionary prediction might be feasible in other model systems. Wichman, Bull and collaborators showed that genetic change occurred over time at many of the same positions in two related bacteriophage strains that evolved on Escherichia coli hosts in the laboratory26,27. Several of these sites showed evidence of positive selection, and these sites made up a disproportionately large share of positions that differed between the two parental phage strains. The positively selected residues were all surface exposed. Site-directed mutagenesis shows that these residues affect host binding, but lie outside of the putative binding site. Whether the positively selected residues are involved directly in binding is, as yet, unknown, but the hypothesis that these residues increase fitness could be tested using direct-competition experiments.

Potential pitfalls

We encountered three problems in our work on influenza A that are rarely, if ever, addressed in other studies of substitution rates5,18. The experimental pitfalls that I describe in this section are not specific to influenza and concern errors in phylogenetic reconstruction5, artefacts caused by laboratory evolution28 and sampling bias29.

Phylogenetic uncertainty. Insufficient sampling can cause error in phylogenetic reconstruction, and consequently error in identifying codons that are under positive selection. One way to estimate sampling error is a statistical technique called bootstrap analysis30. In this technique, new data sets are created by randomly sampling characters from the original data set. The resulting data sets are the same size as the original, but some characters have been left out and others duplicated. The bootstrap value of a node (a branch division on a tree) is the percentage of times that node is present in the set of trees constructed from the new data sets. A bootstrap value of 95% or higher is typically considered good statistical support for a node. We obtained poor bootstrap support for a large number of nodes in our influenza A haemagglutinin tree. When we examined hundreds of equally plausible trees we found an excess of non-synonymous substitutions at some codons in only a small number of trees5. Because we planned further studies based on these results18, we limited our list of putative, positively selected codons to those present in most trees. I know of only one other study that examined this problem: analyses using the HIV-1 gp120 gene were reported to be robust to error in tree topology4.

Laboratory evolution. The study of adaptive evolution focuses on pathogens because of their medical importance and because they evolve quickly enough to be studied in real time. Many pathogens also quickly adapt to laboratory culture conditions; so, sequences obtained from culture might contain artefacts that introduce error into substitution rate analysis. A pertinent example involves human influenza, which is typically cultured inside chicken eggs. Egg-adapted amino-acid replacements are known to occur around the receptor-binding pocket of the haemagglutinin protein. If undetected, these mutations will be assigned as an extra mutation on the terminal branch of a phylogenetic tree (Box 1). This is simply because the affected sequence is grouped with its nearest relative on the basis of similarity at the hundreds of unaffected codons in addition to the egg-adapted codon. We estimated that about 8% of the amino-acid replacements in our influenza data set were egg-adapted artefacts28. To prevent lab artefacts from affecting our analyses, we eliminated all mutations assigned to terminal branches before calculating substitution rates.

Laboratory evolution results in amino-acid replacements in both the gp120 envelope glycoprotein of HIV-1 (Ref. 31) and the VP1 capsid protein of foot-and-mouth disease virus32. It would be interesting to know whether laboratory artefacts also affect the substitution rate studies of these pathogens4,23,33.

Sampling bias. Influenza isol ates sent to the CDC from the WHO collection centres are screened for antigenic similarity to known circulating strains. Isolates that are antigenically indistinguishable from reference strains on the basis of a haemagglutinin-binding test are typically not sequenced. This purposeful sampling bias increases the number of non-synonymous substitutions in our data even when these substitutions imparted no selective advantage to the virus in nature. This sampling bias is most pronounced in the class of substitutions assigned to the terminal branches of the tree29. So, when we eliminated mutations on terminal branches to minimize the effects of laboratory evolution on our analyses, we also reduced the degree to which we overestimated the non-synonymous rate because of sampling bias. I know of no other studies in which the effects of sampling bias have been examined.

Elimination of the mutations that were assigned to the terminal branches of our haemagglutinin tree resulted in a 70% reduction in the number of mutations available for substitution rate analysis. However, the remaining data contained strong evidence for positive selection. Selectively advantageous mutations are, by definition, retained in a population longer than are neutral or deleterious substitutions that occur at the same time. Changes that persist in the population will be assigned to the internal (as opposed to terminal) branches of phylogenetic trees (Fig. 3a). It is difficult to attribute an excess of non-synonymous mutations to positive selection when they occur on lineages that quickly become extinct (Fig. 3b).

Figure 3: Adaptive evolution.
figure 3

The inference we draw from relative substitution rate analysis should be interpreted in the light of where the substitutions appear on the tree. In this illustration, both trees have four non-synonymous substitutions at a single codon, shown as dots, but no synonymous substitutions. Do the two trees provide equal evidence of positive selection? a | Each of the mutations on the tree swept to fixation in the population, implying that they were selectively advantageous. b | The same mutations occurred in lineages that quickly became extinct. The pattern seen in a is therefore stronger evidence for positive selection. In our work on influenza, we eliminated all terminal mutations in our analyses5,18.

Future applications

Although the prospect of predicting evolution is exciting, predictions can be verified only in very rapidly evolving systems. More practical applications of substitution rate analysis include identifying epitopes for vaccine development, constructing theoretical models of protein structure, and interpreting the results of genome sequencing projects.

As described above, all of the putative, positively selected codons of the influenza virus are located in known antibody-binding sites on the exposed surface of the haemagglutinin5. If these binding sites had not been previously identified, our analyses would have pointed to their location. Identifying functionally important sites might develop as one of the chief uses of substitution rate analysis. These methods could be particularly helpful in searching for conformational epitopes — antigenic structures composed of non-contiguous residues that lie near one another only when the protein is correctly folded. These might appear in a gene as scattered codons that show similar evidence of positive selection.

Results of positive selection studies can also be used to help guide construction of theoretical protein structure models. For example, the structures of many porins have yet to be resolved using X-ray crystallography. Protein purification seems to destroy bonds between the porin and other components of the cell membrane that are crucial to its three-dimensional conformation. In the porB gene of Neisseria gonorrhoea3, the putative, positively selected segments lie on the exposed loops of the porin. We might reasonably expect regions of other porins to show this pattern. In another example, Ishimizu et al .34 identified four regions in the seminal RNase (S-RNase ) gene that have an excess of non-synonymous substitutions. This gene is associated with the self-incompatibility system in the Rosaceae. Homology searches based on predicted secondary structure indicate that these four regions are exposed on the surface of the style (the portion of the female reproductive organ on which pollen grains attach and germinate) and thus are candidate sites for recognition of self-derived pollen. These results indicate that identifying the surface-exposed segments of a protein using substitution rate analysis could, along with two-dimensional structure prediction, provide a basis for constructing three-dimensional models of proteins that lack amino-acid homology with proteins of known crystal structure.

Whole-genome surveys provide another exciting area to apply these techniques. Open reading frames obtained from genome surveys are screened for possible function using homology searches against sequences already held in GenBank. Simultaneous calculation of the number of synonymous and non-synonymous differences between homologous sequence pairs could help to identify genes that are under intense natural selection. Results from more sophisticated screens could benefit from substitution rate analysis as well. For instance, Intercell (see link), in collaboration with The Institute for Genomic Research (TIGR, see link), recently announced (at the American Society for Microbiology — TIGR 2001 conference on microbial genomes) an antigen identification technique in which the peptide products of shredded genomic DNA from the Staphylococcus aureus genome were exposed to human antibodies. The question is whether the peptide regions to which antibodies bind in such a screen are also antigenic in their natural form. One might pursue this question by sequencing the same regions in related organisms and by contrasting non-synonymous and synonymous substitution rates. On the basis of the data reviewed above, a high rate of non-synonymous substitution would provide an excellent reason to suspect that a region binds antibodies in its intact as well as in its shredded form.

In summary, we now have analytical methods to identify genes, gene segments and individual codons that are under selective pressure to change. If the evolutionary forces using this selection can be identified, predicting the future course of evolution might be possible in systems such as rapidly evolving pathogens. The broader application of these methods are exciting and diverse, as they bring a new research tool to vaccine design, genome sequence interpretation and protein structure prediction.