Skip Navigation



DNA Research Advance Access published online on September 25, 2007

DNA Research, doi:10.1093/dnares/dsm015
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
14/4/141    most recent
dsm015v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by Bag, S. K.
Right arrow Articles by Dutta, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bag, S. K.
Right arrow Articles by Dutta, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Kazusa DNA Research Institute
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Reverse Polarization in Amino acid and Nucleotide Substitution Patterns Between Human–Mouse Orthologs of Two Compositional Extrema

Sumit K. Bag1, Sandip Paul1, Subhagata Ghosh2 and Chitra Dutta1,2,*

1 Bioinformatics Centre, Indian Institute of Chemical Biology, Kolkata 700 032, India
2 Structural Biology and Bioinformatics Division, Indian Institute of Chemical Biology, 4, Raja S. C. Mullick Road, Kolkata 700 032, India

Received 15 June 2007; accepted 18 June 2007.


    Abstract
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 Supplementary data
 Acknowledgements
 References
 
Genome-wide analysis of sequence divergence patterns in 12 024 human–mouse orthologous pairs reveals, for the first time, that the trends in nucleotide and amino acid substitutions in orthologs of high and low GC composition are highly asymmetric and polarized to opposite directions. The entire dataset has been divided into three groups on the basis of the GC content at third codon sites of human genes: high, medium, and low. High-GC orthologs exhibit significant bias in favor of the replacements, Thr -> Ala, Ser -> Ala, Val -> Ala, Lys -> Arg, Asn -> Ser, Ile -> Val etc., from mouse to human, whereas in low-GC orthologs, the reverse trends prevail. In general, in the high-GC group, residues encoded by A/U-rich codons of mouse proteins tend to be replaced by the residues encoded by relatively G/C-rich codons in their human orthologs, whereas the opposite trend is observed among the low-GC orthologous pairs. The medium-GC group shares some trends with high-GC group and some with low-GC group. The only significant trend common in all groups of orthologs, irrespective of their GC bias, is (Asp)Mouse -> (Glu)Human replacement. At the nucleotide level, high-GC orthologs have undergone a large excess of (A/T)Mouse -> (G/C)Human substitutions over (G/C)Mouse -> (A/T)Human at each codon position, whereas for low-GC orthologs, the reverse is true.

Key words: high-GC orthologs; low-GC orthologs; amino acid replacement matrix; nucleotide replacement matrix; sequence divergence


    1. Introduction
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 Supplementary data
 Acknowledgements
 References
 
Mammalian genomes are highly heterogeneous in base composition. These are composed of long stretches of DNA with distinct GC composition, commonly known as the isochore structures1Go–4Go or GC-content domains.5Go The local GC composition correlates with a number of important genomic features such as gene density, gene length, patterns of gene expression, repeat element distribution, recombination rate etc.6Go–11Go Evolutionary stability of the GC-content distribution has been demonstrated for mice and humans on a genome-wide level.12Go The GC-rich sequences from one genome were demonstrated to be GC rich in the other genome and vice versa. Finding such one-to-one correspondence between the local GC distribution patterns in mouse and human was, however, not trivial. Since the divergence of the rodent and primate lineages at around 84–121 million years ago,13Go,14Go multiple substitutions might have occurred at the same sites of a pair of mouse–human orthologs independently in two lineages and if there had not been a strong directionality of the selection process(es) prevailing over the random mutation and fixation, such multiple substitutions should have randomized the local GC distribution patterns in two genomes. Invariance of the overall patterns of GC distribution along the chromosomes of mouse and human, therefore, suggests that there might be some well-defined trends in the nucleotide and/or amino acid substitution patterns across these two species. The present study was designed to determine such trends, if any.

A number of efforts have been made earlier to determine the evolutionary trends in mammalian genomes, but no definite conclusion could be reached. On the basis of the analysis of orthologous gene sequences from closely related species, it has been proposed that GC-rich regions of primate and cetartiodactyl genomes are becoming GC poorer, i.e. GC-rich isochores are now vanishing in these lineages.15Go–18Go Alvarez-Valin et al.,19Go however, described the ‘vanishing isochores’ effect as an artifact created due to inaccurate reconstruction of ancestral GC levels in such studies,15Go offering an evidence for an AT substitution bias within the repetitive elements of mammals. On the contrary, the maximum parsimony analysis conducted by Gu and Li20Go advocated for recent enrichment of the GC content of GC-rich genes in some genomes, e.g. the rabbit. Therefore, the direction(s) of evolution of mammalian genes is a matter of conjecture. Did mammalian genes of varying GC bias follow distinct evolutionary trajectories, and if yes, to what extent could they influence the evolution of encoded proteins? In an attempt to address these questions, the present study carried out a genome-scale analysis of the trends in nucleotide and amino acid substitutions between human and mouse orthologous pairs of varying GC content. The analysis showed that indeed there exist definite trends not only in nucleotide, but also in amino acid substitution patterns between mouse and human orthologous pairs, and that these trends are, in general, highly asymmetric and polarized to the reverse directions in high-GC and low-GC sets of orthologs in such a way that in course of evolution, the compositional heterogeneity has been significantly enhanced in coding regions in human compared with that in mouse.


    2. Materials and methods
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 Supplementary data
 Acknowledgements
 References
 
2.1. Sequence retrieval
Nucleotide sequences of 12 405 pairs of orthologous coding regions of human and mouse were extracted from the Searchable Prototype Experimental Evolutionary Database (SPEED)21Go (http://www.bioinfobase.umkc.edu/speed/), using an in-house program developed in Perl. To minimize the sampling errors, a total of 174 sequences, which were shorter than 100 codons in either organism, were excluded from the analysis. The remaining sequences were subjected to a codon integrity check using a freely available program, CodonW22Go (http://www.molbiol.ox.ac.uk/cu/), and the dataset was further screened for removing redundant sequences. The final dataset of human mouse orthologs contains 12 024 nonredundant sequences. We generated corresponding nonredundant protein sequence using C program developed in-house.

2.2. Classification of orthologs in three compositional groups
The pairs of orthologous sequences under study exhibited significant correlations, not only between the overall GC contents, but also between the GC contents at third codon sites (GC3) as shown in Fig. 1A and B. These orthologs were then classified into three groups according to the GC3 contents of the human genes: the low-GC group with (GC3)Human < 50%, the medium-GC group with 50% ≤ (GC3)Human ≤ 70%, and the high-GC group with (GC3)Human > 70%. The numbers of pairs of orthologous genes in these three groups were comparable with one another (3896, 3960, and 4168 numbers in high-, medium-, and low-GC groups, respectively). The sequences in these three groups were used to examine the trends in amino acid and nucleotide substitution patterns.


Figure 1
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1.. Scatter plot of (A) overall GC content (%) and (B) GC content at third codon sites (%) of 12 024 orthologous genes of human and mouse with their correlation coefficient values.

 
The total dataset was also classified into another three groups on the basis of GC3 contents of the mouse genes: the new low-GC group with (GC3)Mouse< 50%, the new medium-GC group with 50% ≤ (GC3)Mouse ≤ 70%, and the new high-GC group with (GC3)Mouse > 70%. The entire study carried out with the high-, medium-, and low-GC groups were re-checked with these new high-, new medium-, and new low-GC groups of orthologs.

We classified the datasets on the basis of the GC3 of coding sequences rather than the overall GC, because the GC3 contents of mammalian genes are known to exhibit strong correlation with the GC content of the genomic region, where the genes are located.23Go,24Go

2.3. Analysis of amino acid substitution patterns: evaluation of amino acid replacement matrix (AARM) for different groups of orthologs
The alignments of orthologous sequences of three groups were created separately using the pairwise alignment program ClustalW25Go and only the gap-free aligned regions of length >100 residues were considered to avoid any spurious short-alignment regions. The numbers of pairs of aligned orthologous genes in three groups of sequences with gap-free regions of length >100 residues were less than the previous set (3659, 2669, and 3291 numbers in high-, medium-, and low-GC groups, respectively). Amino acid replacements were calculated for all gap-free alignment regions of >100 residues and also for fully aligned sequences including gaps. The replacement data are represented in a 20 x 20 matrix, designated as amino acid replacement matrix (AARM), as shown in Tables 1, 2, and 3 for gap-free alignment regions of >100 residues. (To avoid confusion with the standard amino acid substitution matrices like PAM or BLOSUM, we have used the term ‘Replacement’ matrix.) The elements of AARM represent the ratio between the number of forward replacements and the number of backward replacements for any specific pairs of residues, i.e. the value of any element Rij of the AARM represents the ratio of the total number of [i]Mouse -> [j]Human replacements to the number of [j]Mouse -> [i]Human replacements. If Rij > 1, then there will be an overall gain in the amino acid residue j at the cost of the amino acid residue i in human compared with that in mouse. If Rij < 1, the reverse will be true. The actual number of forward and backward replacements for all possible pairs of amino acid residues for high-, medium-, and low-GC groups are given in Supplementary Table S1a–c. Other than diagonal positions of the matrices (representing the identical substitutions), all other elements represent the nonidentical substitutions. The replacement values for the gap-free alignment regions were not changed significantly from the result obtained by alignment of full sequences including gaps. In order to test whether there were any significant intra-group variations in the replacement values, subsets of 500 pairs of sequences were taken sequentially from start to end and also randomly from the entire dataset of a specific group of orthologs (i.e. high-/medium-/low-GC group) and 20 x 20 AARM was determined for each subset of sequence pairs. Comparison of the replacement values obtained from different subsets of any particular group was then carried out, and no significant variations in substitution values were found for individual residue pairs within a group. All these computations were done using Substitution Pattern Analysis Software Tool (SPAST), a program in C, developed in-house.


View this table:
[in this window]
[in a new window]

 
Table 1.. Amino Acid Replacement Matrix (AARM) for high-GC group in human mouse orthologs

 


View this table:
[in this window]
[in a new window]

 
Table 2.. Amino Acid Replacement Matrix (AARM) for medium-GC group in human mouse orthologs

 


View this table:
[in this window]
[in a new window]

 
Table 3.. Amino Acid Replacement Matrix (AARM) for low-GC group in human mouse orthologs

 
2.4. Analysis of nucleotide substitution patterns: evaluation of nucleotide replacement matrix (NRM) for three codon sites of different groups of orthologs
We created the nucleotide sequence alignments on the basis of amino acid alignments and calculated the nucleotide replacements in the form of 4 x 4 NRM, individually for three codon positions for three different groups of orthologs under study. The elements rij of NRM represent the ratio of the number of forward replacements to that of backward replacements for any specific pairs of nucleotides. Comparison of the nucleotide replacement values obtained from different subsets of 500 orthologous sequences (taken sequentially from start to end and also randomly) of any particular group was then carried out and no significant variations in replacement values were found for individual nucleotide pairs within a group.

2.5. Tests for statistical significance of different elements (Rij/rij) of AARM and NRM
For a given pair of amino acids or nucleotides, the mouse to human replacement was taken as the forward direction and human to mouse as the reverse direction, and each Rij in AARM or rij in NRM represents the ratio of number of replacements of the residue i by the residue j in the forward direction (mouse to human) to that in the reverse direction (human to mouse). This means that if Rij (or rij) > 1, the number of (i)Mouse -> (j)Human replacements is higher than the number of (j)Mouse -> (i)Human replacements, and if Rij (or rij) < 1, the reverse is true.

For each pair of replacements, the ratio of forward to reverse replacements was expected to be 1:1 under unbiased conditions. To test this hypothesis, the observed and expected numbers (based on a 1:1 ratio) were recorded for each pair of a particular group. In all cases, the Chi-square test was used to assess the significance of the directional bias, if any, at p = 0.001 and 0.0001. For each pair of replacements, the first and second rows of the 2 x 2 contingency table represented the number of replacements from one particular residue (say, i) to another (say, j) of the pair and the total count of the remaining replacements (say, k) from the residue i (where k != j), respectively. The procedure was repeated also for orthologous replacements of 500 sequences taken sequentially from start to end and randomly. The significant (at p < 0.001 or 0.0001) trends for the whole dataset are also consistent with sequences taken sequentially from start to end and randomly.

2.6. Correspondence analysis (COA) on relative synonymous codon usage (RSCU) and estimation of synonymous and nonsynonymous substitution
Correspondence analysis on RSCU26Go was performed using the CodonW 1.4.222Go program to identify the major factors influencing the variation in synonymous codon usage in three groups of orthologous sets. These analyses generate a series of orthogonal axes to identify trends that explain the variation within a dataset, with each subsequent axis explaining a decreasing amount of the variation.

To examine the nucleotide substitution patterns, we estimated the number of synonymous substitutions per synonymous site, dS, and the number of nonsynonymous substitutions per nonsynonymous site, dN, of randomly chosen 500 pairs of ten sets of orthologs in each three groups using the MEGA program (version 2.1), as described by Nei and Gojobori.27Go The values of dS, dN, and dN/dS of three orthologous groups were compared by t-test.


    3. Results
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 Supplementary data
 Acknowledgements
 References
 
3.1. Specific trends in amino acid substitution patterns between mouse and human orthologs
In order to investigate whether the mouse and human proteins of high-, medium-, and low-GC composition followed the same or different evolutionary trajectories since the divergence of the two species, trends in amino acid substitution between the human–mouse orthologous pairs were studied individually in three groups of genes, using the program SPAST developed in-house. Tables 1, 2, and 3 represents the AARMs for aligned regions of orthologous pairs (gap-free regions of length >100 residues at a stretch) in high-, medium-, and low-GC groups, respectively. As already mentioned, the mouse to human replacements was taken by convention as the forward direction and human to mouse as the reverse direction.

For each group of orthologs, some specific amino acid pairs exhibit significant bias in the replacement patterns. For instance, in high-GC group, the value of RIG, i.e. the ratio of [Ile]Mouse -> [Gly]Human to [Gly]Mouse -> [Ile]Human, is 2.51 (p < 0.0001), implying that the frequency of replacement of Ile in mouse sequence with Gly in human is >2.5-fold higher than that in reverse direction, i.e. the frequency of replacement of Gly of mouse sequence with Ile in human. On the contrary, in low-GC group, RIG = 0.48, indicating that in low-GC orthologs, the frequency of substitution of Ile of mouse sequence by Gly in human is more than two-fold lower than the frequency of the reverse substitution. For the medium-GC group, the value of RIG is not statistically significant, suggesting that the frequencies of substitution of Ile by Gly and of Gly by Ile are comparable in cases of medium-GC orthologs of mouse and human.

As Rij = 1/Rji, out of the 380 off-diagonal elements of an AARM (Tables 1, 2, and 3), only 190 are independent. Out of these 190 AARM elements, 53 are significantly biased in a specific direction for high-GC group (46 at p < 0.0001 and seven at p < 0.001), whereas for low-GC group, 67 elements are found to be significantly biased (56 and 11 at p < 0.0001 and 0.001, respectively). In the medium-GC group, only 15 AARM elements are statistically significant (ten at p < 0.0001 and five at p < 0.001). Some of them are shared with the high-GC group and some with the low-GC group.

3.2. Significant trends in amino acid substitution in high-GC and low-GC groups are, in general, opposite to one another
A careful examination of Tables 1, 2, and 3 reveals that when i represents a residue encoded by A/U-rich codons and j represents a residue encoded by relatively G/C-rich codons, the AARM element, Rij, in most cases (but not in all), is >1 in the high-GC group, <1 in the low-GC group, and nearly equal to ~1 in the medium-GC group. Reverse situation occurs, in general, when i represents a residue encoded by G/C-rich codons and j by relatively A/U-rich codons. For instance, for i = Ala (A) (encoded by GCN), RAj is significantly <1 in high-GC group and significantly >1 in low-GC group for j = Ile, Asn, Lys, Ser, Thr, Val, Glu etc (encoded, respectively, by AUH, AAY, AAR, UCN/AGY, ACN, GUN, and GAR). On the contrary, for i = Asn (N) (encoded by AAY), RNj > 1 for high-GC group and <1 for low-GC group, when j = Gly or Ala or Arg (encoded by GGN, GCN, and CGN/AGR, respectively). There are altogether 33 AARM elements, which are polarized to the opposite directions (>1 and <1) in high- and low-GC groups and are found to be statistically significant in both groups.

Table 4 provides the lists of the 15 amino acid pairs having the largest differences in total number of forward (mouse to human) and backward (human to mouse) substitutions between them for three different groups of orthologs under study. There are eight pairs of residues (marked with ±) that appear among the top 15 trends in both high- and low-GC groups, but with opposite directionality (Table 4). There are four other pairs of residues among the top 15 of the high-GC group (marked with +), which exhibit significant, but opposite, bias in the low-GC group (Table 4), but did not come among the top 15 in the later group. Similarly, there are also four pairs (marked with –) among the top 15 of the low-GC group, which are opposite and significant, but not among the top 15 in the high-GC group. Thus, the trends in amino acid substitutions between the mouse and human orthologs follow reverse directionality, in general, in the high- and low-GC groups. Among the top 15 trends in the medium-GC groups, some are common in directionality with high-GC group and some with the low-GC group.


View this table:
[in this window]
[in a new window]

 
Table 4.. Top 15 amino acid pairs of three orthologous groups according to differences in number of forward (mouse to human) and reverse (human to mouse) replacements in AARM

 
In the high-GC group, although amino acids of mouse proteins encoded by A/U-rich codons tend to be replaced by the amino acids encoded by G/C-rich codons in their human orthologs, not all amino acid residues encoded by A/U-rich codons exhibit equal bias in replacement patterns. There are six residues, viz. Phe, Tyr, Met, Ile, Asn, and Lys, which are encoded by A/U-rich codons and four residues, viz. Gly, Ala, Arg, and Pro, encoded by G/C-rich codons. As can be seen from Tables 1, 2, and 3, among the amino acid residues encoded by A/U-rich codons, Ile, Asn, and Lys have a more number of significantly biased replacement ratios (AARM elements) both in the high-GC group and in the low-GC group. Replacement ratios of Phe and Tyr, though follow the general trend, are not statistically significant in most cases. Rather, some other residues like Ser, Thr, Val, Leu etc., which are not necessarily encoded by A/U codons, exhibit significant bias in the replacement ratios (Tables 1 and 3). Similarly, among Gly, Ala, Arg, and Pro, the former two have more number of significant Rij values. Previous analysis of many prokaryotic genomes28Go and high-GC rice genes with their Arabidopsis homologs29Go showed that proteins encoded by GC-rich sequences are characterized by increased levels of Gly, Ala, Arg, and Pro residues and a corresponding decrease in Phe, Tyr, Met, Ile, Asn, and Lys residues. It is, therefore, intriguing to examine to what extent the overall usage of the residues Gly/Ala/Arg/Pro and that of Phe/Tyr/Met/Ile/Asn/Lys vary within the mouse and human orthologs of high-, medium- and low-GC groups. Our analysis indicates that the mouse and human orthologs of three groups are indeed characterized by distinct usage profile of these residues (Supplementary Fig. S1). In the high-GC group, the human orthologs have higher usage of Gly, Ala, Arg, and Pro and lower usages of Phe, Tyr, Met, Ile, Asn, and Lys compared with their mouse orthologs, but the differences are not as pronounced as shown previously for homologous gene pairs from the rice and Arabidopsis, having large difference in GC content.29Go In the low-GC group, the reverse is true, whereas in the medium-GC group, there is no significant difference between mouse and human orthologous pairs in the usage of these two groups of residues.

3.3. (Asp)Mouse -> (Glu)Human trend in all groups of orthologs irrespective of their GC content
There is only one replacement ratio RDE, which exhibits same directionality and almost the same value in all three AARMs (Tables 1, 2, and 3), indicating that in all groups of orthologs, the frequency of (Asp)Mouse-> (Glu)Human replacements is slightly higher than the replacement in the opposite direction. As the Asp -> Glu replacement is among the top 15 trends in substitution in all three groups (Table 4), it is one of the most common trends in amino acid replacement in mouse–human orthologs. These observations suggest that irrespective of the GC content of the encoding genes, there has been a consistent increase in glutamic acid in human proteins at the cost of aspartic acid compared with their mouse orthologs. The structural and/or functional implications of this unique evolutionary trend is, however, not clear. There are two other substitution trends, Ser -> Thr and Phe -> Tyr, which also exhibit same directionality in all three groups under study, but the replacement values are not statistically significant for the high-GC group.

3.4. High-GC orthologs are biased towards (A/T)Mouse -> (G/C)Human replacements, whereas in low-GC orthologs, (G/C)Mouse -> (A/T)Human replacements prevail
As already emphasized, the major trends in amino acid replacements between mouse and human orthologs (Tables 14) indicate that in the high-GC group, the amino acid residues encoded by relatively GC-rich codons tend to increase in human proteins compared with mouse orthologs, and in the low-GC group, the reverse trends prevail. In the medium-GC-group, however, there is no such specific directionality in codon substitution patterns. These observations have prompted us to examine the trends in nucleotide substitution patterns individually in three codon positions in three groups of orthologs. As can be seen from the NRMs shown in Table 5, in the high-GC dataset, rij, is significantly greater than 1, when i = A or T and j = G or C. On the contrary, rij is significantly less than 1, when i = G or C and j = A or T. These trends are valid in all three codon positions, although the deviation of rij from 1 (for a particular set of m and n) is highest in third codon positions, followed by the first and second codon positions. Therefore, in the high-GC group, there has been an excess of (A/T)Mouse -> (G/C)Human replacements over (G/C)Mouse -> (A/T)Human at each codon position individually. For the low-GC group, the reverse situation has been encountered (Table 5), i.e. there is a tendency for G and C in mouse genes to be replaced by A or T in their human orthologs, the bias being maximal at the third codon positions. For the medium-GC group, however, no significant difference between (A/T)Mouse -> (G/C)Human and (G/C)Mouse -> (A/T)Human replacements could be observed at the first and second codon sites, whereas for the third codon sites, the (A)Mouse -> (G)Human and (T)Mouse -> (C)Human replacements dominate over the reverse replacements. These observations imply that for the high-GC group, either the GC content tends to increase in human genes relative to mouse or tends to decrease in mouse genes relative to human, whereas for low-GC group, either there is a trend in relative decrease in GC content in human compared with mouse or there is a trend in relative increase in GC content in mouse compared with human. This suggests that with time, there is a relative increase in compositional heterogeneity within human genes compared with that within mouse genes or decrease in compositional heterogeneity within mouse genes compared with that within human genes.


View this table:
[in this window]
[in a new window]

 
Table 5.. Nucleotide replacement matrices at (NRMs) three codon positions for human mouse orthologs of high-, medium-, and low-GC groups under study

 
Our next task was to see to what extent the observed trends in nucleotide substitution patterns have affected the relative GC divergence between mouse and human orthologs. To this end, the number of genes was plotted against their GC12 and GC3 values both for mouse and for human in all three groups of orthologs (Fig. 2) using STATISTICA (version 6.0). In all cases, normal distributions were obtained (Fig. 2). In high-GC group, both GC12 and GC3 distributions in human are skewed towards right (increasing GC contents) compared with mouse (Fig. 2A and B), but for low-GC group, the reverse is true (Fig. 2E and F). The extent of inter-species divergence in GC distribution is much more apparent in case of third codon positions (Fig. 2B, D, and F) compared with the first and second positions (Fig. 2A, C, and E). For medium-GC orthologs, medians of both GC12 and GC3 distributions are almost same in both species under study (Fig. 2C and D). These observations imply that the intra-species divergence in base composition is higher in case of human genes than that in their mouse orthologs such that among the GC-rich pairs of orthologs, human coding sequences are usually higher in GC content than their mouse counterparts, but among the GC-poor orthologous pairs, human coding sequences are, in general, lower in GC content than the respective mouse sequences.


Figure 2
View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2.. Left panel: distribution of GC content at first and second codon positions among human (blue) and mouse (red) orthologous genes for (A) high-, (C) medium-, and (E) low-GC group with their normal distributions. Right panel: distribution of GC content at third codon position among human and mouse orthologous genes for (B) high-, (D) medium-, and (F) low-GC group with their normal distributions.

 
3.5. Multivariate analysis of synonymous codon usage confirms opposite trends in high-GC and low-GC groups of orthologs
The skewness of GC3 in human genes towards increasing GC in high-GC group and decreasing GC in low-GC group compared with mouse orthologs (Fig. 2B, D, and F) has also been reflected in the COA of RSCU. Fig. 3A–C represents axis-1 versus axis-2 plot of the COA on RSCU of genes in three different groups. In all cases, axis-1 exhibits strong negative correlation with GC content at synonymous substitution sites (GC3S). The distribution of human and mouse genes along axis-1 confirms that in high-GC group (Fig. 3A), human genes exhibit higher usage of G or C ending synonymous codons compared with their mouse orthologs, whereas for low-GC group, the reverse trend dominates (Fig. 3C). For medium-GC group, as expected, usage of G/C-ending codons is comparable in mouse and human (Fig. 3B).


Figure 3
View larger version (36K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3.. Positions of orthologous gene pairs between human and mouse along the first two identical principal axes generated by COA on RSCU values of (A) 3896 genes from high-GC group, (B) 3960 genes from medium-GC group, and (C) 4168 genes from low-GC group. The filled quadrangle and open quadrangle represent genes from mouse and human orthologous genes, respectively.

 
It is worth mentioning that the GC contents of the synonymous substitution sites in the mouse and human orthologous pairs exhibit negative correlations in all three groups (supplement-I). These observations are in accordance with the previous report by Takahashi and Nakashima.30Go

3.6. Rate of synonymous and nonsynonymous substitutions are same in all three groups of orthologs
In order to examine whether the rate of nucleotide substitution between mouse and human orthologs varies in three different groups, the number of synonymous substitutions per synonymous site, dS, and the number of nonsynonymous substitutions per nonsynonymous site, dN, were estimated for randomly selected 500 pairs of the orthologs from each group. The value of dS remains almost same in all three groups (data not shown). The value of dN apparently seems to be lower in the medium-GC group, but its difference with its values for the other two groups is not statistically significant. These observations indicate that although the trends in nucleotide substitutions are polarized to the opposite directions in the high- and low-GC groups of orthologs, the rates of synonymous or nonsynonymous substitutions did not vary with the GC bias of the genes.


    4. Discussion
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 Supplementary data
 Acknowledgements
 References
 
Since the divergence of the rodent and primate lineages, multiple substitutions might have occurred at the same site of a pair of mouse–human orthologs independently in two lineages. Had there not been a strong directionality of selection process(es) prevailing over the random mutational events, such multiple hits should have obscured the true pattern of substitution, if any, between such orthologous pairs. However, the present study has revealed that the nucleotide and amino acid substitution patterns in mouse–human orthologs have followed definite trends that are highly asymmetric and polarized to opposite directions in high- and low-GC groups, suggesting that indeed there has been a definite directionality in gene/protein evolution towards increasing compositional divergence in human protein-coding regions compared with that in mouse protein-coding regions or towards decreasing compositional divergence in mouse protein-coding regions compared with that in human protein-coding regions. It is true that the GC content shows evolutionary stability between mouse and human, i.e. orthologs have similar GC contents in two species, but among the high-GC orthologs, human proteins are slightly higher in GC content than their mouse orthologs, whereas among the low-GC orthologs, human proteins are slightly higher in AT content than their mouse counterparts.

A question may be raised at this point: why, of all mammalian species, only mouse and human were chosen as the species of study in the present report. The reason is as follows: initially we intended to analyze the sequence divergence patterns between the orthologous coding regions of human, chimpanzee, and rhesus monkey. However, the numbers of nonsynonymous substitutions between two orthologs of any two primate species were often too low to reveal any significant statistical trend. Therefore, we have decided to analyze the trends in substitution patterns between a rodent and a primate species, mouse and human have been chosen as the representative species of the two lineages.

As already mentioned in Section 2, the trends reported here are robust enough to be valid for any subset of the total datasets of orthologous sequences. Any trend in amino acid/nucleotide replacement between the pairs of orthologs of a particular dataset remains invariant, in general, when a subset of sequences are chosen randomly from that particular dataset. This indicates that same trends are usually followed individually by each pair of orthologs in a particular group (high-, medium-, or low-GC group).

The trends in amino acid and nucleotide replacement patterns also remained same when the orthologous sequences were classified in high-, medium-, and low-GC groups on the basis of the GC3 content of mouse genes instead of human genes. The same previous directionality was observed for high- or low-GC groups, i.e. GC content either increase in human genes relative to mouse or decrease in mouse genes relative to human for the high-GC group, whereas for low-GC group, either there is relative decrease in GC content in human genes compared with mouse gene or relative increase in GC content in mouse genes compared with human gene. This was, however, expected as the two genome sequences exhibit a one-to-one correspondence in their local GC content.

The only significant trend common in all three groups of orthologs is (Asp)Mouse -> (Glu)Human. Surprisingly, the value of RDE is almost same in all three groups and the trend has also been exhibited by the subsets chosen randomly from the whole dataset of any particular compositional group. This indicates that this trend, in general, does not alter with the compositional bias or functional characteristics of the genes. In accordance with this, average frequency of Glu (7.01% for mouse and 7.11% for human) is significantly higher in human (p < 10–5) and that of Asp (4.90% for mouse and 4.81% for human) is significantly higher in mouse (p < 10–5). The structural consequence of this trend is, however, not clear.

No significant differences could be observed between the synonymous or nonsynonymous substitution rates in three groups of orthologs under study. This suggests that although the directionality of evolution in orthologs of two extreme GC compositions is oppositely polarized, the rate at which they evolve is almost same in both cases.

In a nutshell, the present study indicates that in comparison with mouse, the coding regions of the human genome have experienced an expansion, not shrinkage, in intra-species heterogeneity in local GC content. This observation, however, does not warrant the relative expansion of the human GC islands as a whole, since it would depend not only on the evolutionary trends of the coding region, but also on those of the noncoding regions. One should also remember that a relative increase in GC heterogeneity in human orthologs compared with mouse orthologs not necessarily implies an absolute increase in GC heterogeneity in human coding regions with evolution. In absolute sense, both human and mouse might have evolved towards decreasing compositional heterogeneity, the rate of decrease in heterogeneity being less in human than in mouse, or alternatively, both the species might be evolving towards increasing intra-species inhomogeneity, the rate of increase being higher in human relative to mouse.


    Funding
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 Supplementary data
 Acknowledgements
 References
 
Council of Scientific and Industrial Research (Project No. CMM 0017 to C.D and S.G); Department of Biotechnology, Government of India (BT/BI/04/055-2001 to S.K.B and S.P).


    Supplementary data
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 Supplementary data
 Acknowledgements
 References
 
Supplementary data are available online at http://www.dnaresearch.oxfordjournals.org


    Acknowledgements
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 Supplementary data
 Acknowledgements
 References
 
We are grateful to Dr. A. Pan, Indian Association for the Cultivation of Science, Kolkata, India, for critical reading of the manuscript.


    Footnotes
 
* To whom correspondence should be addressed. Tel. +91 33-2473-3491. Fax. +91 33-2473-0284. E-mail: cdutta{at}iicb.res.in

Edited by Hiroyuki Toh


    References
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 Supplementary data
 Acknowledgements
 References
 

  1. Bernardi G. Isochores and the evolutionary genomics of vertebrates. Gene (2000) 241:3–17.[CrossRef][Web of Science][Medline]
  2. Eyre-Walker A, Hurst L. D. The evolution of isochores. Nat. Rev. Genet. (2001) 2:549–555.[CrossRef][Web of Science][Medline]
  3. Filipski J, Thiery J. P, Bernardi G. An analysis of the bovine genome by Cs2SO4-Ag density gradient centrifugation. J. Mol. Biol. (1973) 80:177–197.[CrossRef][Web of Science][Medline]
  4. Hughes S, Zelus D, Mouchiroud D. Warm-blooded isochore structure in Nile crocodile and turtle. Mol. Biol. Evol. (1999) 16:1521–1527.[Abstract]
  5. Lander E. S, Linton L. M, Birren B, et al. Initial sequencing and analysis of the human genome. Nature (2001) 409:860–921.[CrossRef][Medline]
  6. Caron H, van Schaik B, van der Mee M, et al. The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science (2001) 291:1289–1292.[Abstract/Free Full Text]
  7. Fullerton S. M, Bernardo Carvalho A, Clark A. G. Local rates of recombination are positively correlated with GC content in the human genome. Mol. Biol. Evol. (2001) 18:1139–1142.[Free Full Text]
  8. Kong A, Gudbjartsson D. F, Sainz J, et al. A high-resolution recombination map of the human genome. Nat. Genet. (2002) 31:241–247.[CrossRef][Web of Science][Medline]
  9. Lercher M. J, Urrutia A. O, Pavlicek A, Hurst L. D. A unification of mosaic structures in the human genome. Hum. Mol. Genet. (2003) 12:2411–2415.[Abstract/Free Full Text]
  10. Mouchiroud D, D'Onofrio G, Aissani B, Macaya G, Gautier C, Bernardi G. The distribution of genes in the human genome. Gene (1991) 100:181–187.[CrossRef][Web of Science][Medline]
  11. Saccone S, De Sario A, Wiegant J, Raap A. K, Della Valle G, Bernardi G. Correlations between isochores and chromosomal bands in the human genome. Proc. Natl. Acad. Sci. USA (1993) 90:11929–11933.[Abstract/Free Full Text]
  12. Waterston R. H, Lindblad-Toh K, Birney E, et al. Initial sequencing and comparative analysis of the mouse genome. Nature (2002) 420:520–562.[CrossRef][Medline]
  13. Nei M, Glazko G. V. The Wilhelmine E. Key 2001 Invitational Lecture. Estimation of divergence times for a few mammalian and several primate species. J. Hered. (2002) 93:157–164.[CrossRef][Web of Science]
  14. Glazko G. V, Koonin E. V, Rogozin I. B. Molecular dating: ape bones agree with chicken entrails. Trends Genet. (2005) 21:89–92.[CrossRef][Web of Science][Medline]
  15. Arndt P. F, Petrov D. A, Hwa T. Distinct changes of genomic biases in nucleotide substitution at the time of Mammalian radiation. Mol. Biol. Evol. (2003) 20:1887–1896.[Abstract/Free Full Text]
  16. Duret L, Semon M, Piganeau G, Mouchiroud D, Galtier N. Vanishing GC-rich isochores in mammalian genomes. Genetics (2002) 162:1837–1847.[Abstract/Free Full Text]
  17. Smith N. G, Webster M. T, Ellegren H. Deterministic mutation rate variation in the human genome. Genome Res. (2002) 12:1350–1356.[Abstract/Free Full Text]
  18. Webster M. T, Smith N. G, Ellegren H. Compositional evolution of noncoding DNA in the human and chimpanzee genomes. Mol. Biol. Evol. (2003) 20:278–286.[Abstract/Free Full Text]
  19. Alvarez-Valin F, Clay O, Cruveiller S, Bernardi G. Inaccurate reconstruction of ancestral GC levels creates a ‘vanishing isochores’ effect. Mol. Phylogenet. Evol. (2004) 31:788–793.[CrossRef][Web of Science][Medline]
  20. Gu J, Li W. H. Are GC-rich isochores vanishing in mammals? Gene (2006) 385:50–56.[CrossRef][Web of Science][Medline]
  21. Vallender E. J, Paschall J. E, Malcom C. M, Lahn B. T, Wyckoff G. J. SPEED: a molecular-evolution-based database of mammalian orthologous groups. Bioinformatics (2006) 22:2835–2837.[Abstract/Free Full Text]
  22. Penden J, Sharp P. M. CodonW (1997) v. 1.4.2:1.4.2.
  23. Aissani B, D'Onofrio G, Mouchiroud D, Gardiner K, Gautier C, Bernardi G. The compositional properties of human genes. J. Mol. Evol. (1991) 32:493–503.[CrossRef][Web of Science][Medline]
  24. Bernardi G, Olofsson B, Filipski J, et al. The mosaic genome of warm-blooded vertebrates. Science (1985) 228:953–958.[Abstract/Free Full Text]
  25. Thompson J. D, Higgins D. G, Gibson T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. (1994) 22:4673–4680.[Abstract/Free Full Text]
  26. Sharp P. M, Li W. H. The codon Adaptation Index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. (1987) 15:1281–1295.[Abstract/Free Full Text]
  27. Nei M, Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. (1986) 3:418–426.[Abstract]
  28. Singer G. A, Hickey D. A. Nucleotide bias causes a genomewide bias in the amino acid composition of proteins. Mol. Biol. Evol. (2000) 17:1581–1588.[Abstract/Free Full Text]
  29. Wang H. C, Singer G. A, Hickey D. A. Mutational bias affects protein evolution in flowering plants. Mol. Biol. Evol. (2004) 21:90–96.[Abstract/Free Full Text]
  30. Takahashi N, Nakashima H. Negative correlation of G + C content at silent substitution sites between orthologous human and mouse protein-coding sequences. DNA Res. (2006) 13:135–140.[Abstract/Free Full Text]

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
14/4/141    most recent
dsm015v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by Bag, S. K.
Right arrow Articles by Dutta, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bag, S. K.
Right arrow Articles by Dutta, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?