DNA Research Advance Access originally published online on October 17, 2006
DNA Research 2006 13(4):135-140; doi:10.1093/dnares/dsl007
© The Author 2006. Kazusa DNA Research Institute
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org
Negative Correlation of G+C Content at Silent Substitution Sites Between Orthologous Human and Mouse Protein-Coding Sequences
Naoki Takahashi and
Hiroshi Nakashima*
Department of Clinical Laboratory Science, Graduate Course of Medical Science and Technology Division of Health Science, Kanazawa University, 5-11-80 Kodatsuno, Kanazawa 920-0942, Japan
Received 31 January 2006; revised 22 August 2006
 |
Abstract
|
|---|
We conducted a genome-wide analysis of variations in guanine
plus cytosine (G+C) content at the third codon position at silent
substitution sites of orthologous human and mouse protein-coding
nucleotide sequences. Alignments of 3776 human protein-coding
DNA sequences with mouse orthologs having >50 synonymous
codons were analyzed, and nucleotide substitutions were counted
by comparing sequences in the alignments extracted from gap-free
regions. The G+C content at silent sites in these pairs of genes
showed a strong negative correlation (
r = 0.93). Some
gene pairs showed significant differences in G+C content at
the third codon position at silent substitution sites. For example,
human thymine-DNA glycosylase was A+T-rich at the silent substitution
sites, while the orthologous mouse sequence was G+C-rich at
the corresponding sites. In contrast, human matrix metalloproteinase
23B was G+C-rich at silent substitution sites, while the mouse
ortholog was A+T-rich. We discuss possible implications of this
significant negative correlation of G+C content at silent sites.
Key words: G+C content variation; humanmouse orthologs; nucleotide substitutions in synonymous codons
The availability of complete mammalian genome sequences1

4
provides an opportunity to characterize nucleotide substitution
patterns among mammalian genomes by comparative sequence analysis.5

11
The G+C content in bacterial genomes varies among species from
25% to 75%, but it is relatively homogeneous for genes within
a given bacterial genome.12

,13

However, the G+C content of genes
in a given mammalian genome varies considerably, because mammalian
genomes are a mosaic of long (over hundreds of kilobases) DNA
segments known as isochores.14

Some G+C-poor isochores have
G+C contents as low as 35%, while G+C-rich isochores have G+C
contents as high as 60%. It is known that the genes within a
given isochore are fairly homogeneous in G+C content. It has
been reported that some homologous mammalian genes that occupy
different chromosomal positions differ considerably in their
base composition and codon usage.14

16

Human

- and ß-globin
genes are an example of this position-dependent variation. The

-globin gene cluster occupies a G+C-rich region, while the ß-globin
gene resides in a G+C-poor region.14
In comparing alignments of orthologous human and mouse sequences, we noted that silent substitution sites at the third codon position were biased toward G+C-rich or A+T-rich nucleotides. For example, human thymine-DNA glycosylase was A+T-rich at silent substitution sites, while the orthologous mouse sequence was G+C-rich at the corresponding sites. However, human matrix metalloproteinase 23B was G+C-rich at silent substitution sites, whereas the mouse ortholog was A+T-rich at those sites. Since complete human and mouse genome sequences are now available, we conducted a comparative genome-wide analysis of G+C content variation at silent substitution sites in orthologous human and mouse sequences.
 |
1. Correlation of G+C Content at the Third Codon Position at Silent Substitution Sites
|
|---|
The G+C content at the third codon position in synonymous codons,
i.e. silent substitutions of the same amino acid, was determined
for both human and mouse sequences. For simplicity, only synonymous
codons that had an identical nucleotide at the first codon position
were considered. The G+C contents at the third codon position
at silent substitution sites in the 3776 pairs of human and
mouse genes are plotted in
Fig. 1, showing a high negative correlation
(
r = 0.93). The plot indicates distinct variations in
G+C content at mutual silent sites in many human and mouse orthologs.

View larger version (24K):
[in this window]
[in a new window]
|
Figure 1. G+C contents at silent sites in 3776 human protein-coding sequences versus G+C contents at silent sites in 3776 corresponding mouse sequences. The human and mouse cDNA sequences were obtained from Reference Sequence Release 11 from the U.S. National Center for Biotechnology Information (ftp://ftp.ncbi.nih.gov/refseq/). The protein-encoding nucleotide sequences were selected according to the feature table for the data. The amino acid sequences of 28 893 human cDNAs and 25 298 mouse cDNAs were obtained by translation. Orthologs were identified by the two-directional best hit approach using BLASTP.17 Pairs of a given sequence were selected if they showed >30% amino acid identity over three-fourths of the total length. To avoid bias, proteins showing >30% sequence identity with other proteins in the same species were excluded. Gap-free alignment regions longer than 100 amino acid residues and the corresponding DNA sequences were analyzed. Based on this criterion, 3776 pairwise alignments of human and mouse sequences that had >50 synonymous codons were chosen for analysis in this study.
|
|
In 2-fold degenerate codons, the equivalent third position nucleotides
are either two purines (A/G) or two pyrimidines (C/T). Therefore,
their silent substitutions always result in G+C content variation.
However, not every silent substitution in 4-fold degenerate
codons yields G+C content variation. To further examine G+C
content correlation at silent substitution sites, the G+C contents
in 4-fold degenerate codons were analyzed. The G+C contents
at silent substitution sites were determined for the eight sets
of 4-fold degenerate codons (Ala: GCN, Arg: CGN, Gly: GGN, Leu:
CUN, Pro: CCN, Ser: UCN, Thr: ACN and Val: GUN) in human and
mouse sequences. The plot of the G+C contents in 4-fold degenerate
sites between 2084 human and mouse orthologous sequences having
>50 4-fold degenerate codons had a high negative correlation
coefficient (
r = 0.82; data not shown).
 |
2. Classification of Orthologous Sequences According to G+C Content at the Third Codon Position at Silent Substitution Sites
|
|---|
Human and mouse orthologous sequence pairs were divided into
three groups according to the G+C content at the third codon
position in synonymous substitution sites. In group (a), the
human gene had a much lower G+C content than that in the mouse
gene, in group (b) the human gene had a much higher G+C content
than that in the mouse gene, and in group (c) the human gene
had a G+C content similar to that in the mouse gene. The number
of genes in the three groups varied according to the criterion
of the classification. Using a cut-off of 30% lower G+C content
in the human gene than in the mouse, 25.4% (960/3776) of the
orthologous sequence pairs were classified in group (a). Based
on 30% higher G+C content in the human gene than in the mouse
gene, group (b) contained 17.2% (648/3776) of the orthologous
sequence pairs. Group (c) contained the remaining 57.4% (2168/3776)
of sequence pairs, which showed deviations between 30
and +30% in G+C content when human and mouse orthologous sequences
were compared.
Table 1 lists the proteins encoded by 10 well-characterized
genes in each of the three groups. Most of the gene pairs within
groups (a), (b) and (c) had different chromosomal locations.
However, some genes within the same group had the same chromosomal
locations. For example, in group (a), human potassium channel
tetramerization domain containing protein 3 and human ribosomal
protein S6 kinase are both located on human chromosome 1q41,
and both of the corresponding mouse genes are on mouse chromosome
1H6. In group (b), human agrin, human calcium binding protein
Cab45 precursor, and human matrix metalloproteinase 23B are
all located on human chromosome 1p36.33, and the corresponding
mouse genes are on mouse chromosome 4E2. These findings suggest
that the G+C content at some silent substitution sites might
be determined by their chromosomal locations. This finding is
consistent with the report by Bernardi et al.14
View this table:
[in this window]
[in a new window]
|
Table 1. Variations in G+C content at the third codon position in synonymous substitution sites (G+C III) for protein-coding genes
|
|
 |
3. Frequencies of Substitution
|
|---|
The parts of the alignments for thymine-DNA glycosylase (human
gene: NM_003211
[GenBank]
.3 and mouse gene: NM_011561
[GenBank]
.1) and for matrix
metalloproteinase 23B (human gene: NM_006983
[GenBank]
.1 and mouse gene:
NM_011985
[GenBank]
.1) are shown in
Fig. 2a and b, respectively. In
Fig. 2,
nucleotides A and T at silent substitution sites are red, nucleotides
G and C at silent substitution sites are blue, and other nucleotides
are shown in yellow. Amino acids are shown along the DNA sequences.

View larger version (52K):
[in this window]
[in a new window]
|
Figure 2. Panels (a) and (b) show the alignment between sequences of human and mouse thymine-DNA glycosylase, and alignment of human and mouse sequences of matrix metalloproteinase 23B.
|
|
Nucleotide substitutions were observed in 14% (738 506/5 401
758) of all of the total nucleotides contained within the 3776
pairs of orthologous human and mouse genes. Substitution frequencies
in codons were 18% at the first position, 12% at the second
and 70% at the third. Silent nucleotide substitutions with an
identical nucleotide at the first codon position accounted for
58% (425 945/738 506) of the total substitutions. The substitution
frequency of transitions (purinepurine and pyrimidinepyrimidine
substitutions) was 66.1%, and that of transversions (purinepyrimidine
and pyrimidinepurine substitutions) was 33.9%. Transitions
accounted for 72.1%, and transversions for 27.9%, of the total
silent nucleotide substitutions. Transitions were more frequent
than transversions at silent substitutions because transitions
at the third codon position are essentially silent.
When a silent substitution was observed at an alignment site, a silent nucleotide substitution was assumed to occur once in either branch at a synonymous codon site since the divergence of human and mouse lineage. Silent substitutions at synonymous codon sites between human and mouse sequences were estimated. Nucleotide substitutions were considered from the human sequence. The frequencies of the four nucleotides A, T, C and G at the third codon position at silent substitution sites in human sequence were expressed as f(A), f(T), f(C) and f(G). Let
and ß be the transition and the transversion substitution rate per year per site. T indicates the divergence time between human and mouse. Then, the nucleotide substitution frequencies at silent sites from human to mouse were calculated as shown in Table 2. Substitution frequency at G or C nucleotide in human silent sites is 2T(
+2ß)·(f(G)+f(C)), and that in mouse silent sites is 2T(
+ß)·(f(A)+f(T))+2Tß·(f(G)+f(C)), which is equivalent to 2T(
+ß)-2T
·(f(G)+f(C)). The above nucleotide substitution frequencies were expressed as X and Y, respectively.
 |
 |
The relationship between
X and
Y is
This equation
indicates that
Y increases when
X decreases and
Y decreases
when
X increases. This result indicated a negative correlation
in the variation of G+C content at silent sites between the
two DNA sequences.
 |
4. The Implications of Substitutions at Silent Sites
|
|---|
Because substitutions at silent sites in codons do not change
amino acids, no effect on proteins would be expected, and these
substitutions are commonly thought of as being evolutionarily
neutral. However, substitutions at silent sites do alter codon
usage. Grantham reported that synonymous codons are used differently
by different organisms,18

and Ikemura found a strong positive
correlation between codon usage and tRNA content in unicellular
organisms.19

The codon-choice patterns of genes are often very
different among multicellular eukaryotes, and codon usage in
mammals is known to have dramatic effects on the translation
rate.20

Our findings on the differential codon usage between
human and mouse genes suggest the possibility of different expression
patterns.
Evidence indicates that genes with a high G+C content at the third codon position are usually surrounded by long G+C-rich genomic sequences, while those with a low G+C content at the third position are usually surrounded by long A+T-rich sequences.19
,21
Humanmouse genome sequence comparisons demonstrated a large number of rearrangements of conserved syntenic segments.2
,22
Since human and mouse genomes exhibit large variations in G+C content (e.g. isochores),14
the rearrangements might produce a large deviation in G+C content between human and mouse genes by changing the surrounding sequences. The gene pairs classified into groups (a) and (b), which showed a large variation in G+C content at silent substitution sites, are considered to be products of the rearrangements of syntenic segments. The genes located in an identical syntenic segment exhibited similar G+C content variation. Gene rearrangements could be the cause of large variations in G+C content at silent substitution sites, and might lead to a significant negative correlation.
Nucleotide substitutions between human and mouse sequences have accumulated during evolution ever since their divergence from a common ancestor. It is generally assumed that the substitutions occurred independently in the two species, and there would seem to be no connection between the silent nucleotide substitutions in humans and mice. The results of this study raise the question of whether correlated substitutions at silent sites might have some possible evolutionary function. Further study is needed to address this issue.
 |
Acknowledgements
|
|---|
We are grateful to one anonymous referee and an editor for constructive
suggestions regarding negative correlation.
 |
Footnotes
|
|---|
*To whom correspondence should be addressed. Tel. +81-76-265-2582, Fax. +81-76-234-4360, E-mail:
naka{at}kenroku.kanazawa-u.ac.jp
Communicated by Hiroyuki Toh
 |
References
|
|---|
- International Human Genome Sequencing Consortium. 2001, Initial sequencing and analysis of the human genome, Nature, 409, 860921.[CrossRef][Medline]
- Mouse Genome Sequencing consortium. 2002, Initial sequencing and comparative analysis of the mouse genome, Nature, 420, 520562.[CrossRef][Medline]
- Rat Genome Sequencing Project Consortium. 2004, Genome sequence of the Brown Norway Rat yields insights into mammalian evolution, Nature, 428, 493521.[CrossRef][Medline]
- The Chimpanzee Sequencing and Analysis Consortium. 2005, Initial sequence of the chimpanzee genome and comparison with the human genome, Nature, 437, 6987.[CrossRef][Medline]
- Castresana, J. 2002, Genes on human chromosome 19 show extreme divergence from the mouse orthologs and a high GC content, Nucleic Acids Res., 30, 17511756.[Abstract/Free Full Text]
- Hardison, R. C., Roskin, K. M., Yang, S., et al. 2003, Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution, Genome Res., 13, 1326.[Abstract/Free Full Text]
- Yang, S., Smit, A. F., Schwartz, S., et al. 2004, Patterns of insertions and their covariation with substitutions in the rat, mouse, and human genomes, Genome Res., 14, 517527.[Abstract/Free Full Text]
- Cooper, G. M., Brudno, M., Stone, E. A., Dubchak, I., Batzoglou, S., Sidow, A. 2004, Characterization of evolutionary rates and constraints in three mammalian genomes, Genome Res., 14, 539548.[Abstract/Free Full Text]
- Zhang, Z. and Gerstein, M. 2003, Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes, Nucleic Acids Res., 31, 53385348.[Abstract/Free Full Text]
- Subramanian, S. and Kumar, S. 2003, Neutral substitutions occur at a faster rate in exons than in noncoding DNA in primate genomes, Genome Res., 13, 838844.[Abstract/Free Full Text]
- Clark, A. G., Glanowski, S., Nielsen, R., et al. 2003, Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios, Science, 302, 19601963.[Abstract/Free Full Text]
- Muto, A. and Osawa, S. 1987, The guanine and cytosine content of genomic DNA and bacterial evolution, Proc. Natl Acad. Sci. USA, 84, 166169.[Abstract/Free Full Text]
- Lawrence, J. G. and Ochman, H. 1997, Amelioration of bacterial genomes: rates of change and exchange, J. Mol. Evol., 44, 383397.[CrossRef][Web of Science][Medline]
- Bernardi, G., Olofsson, B., Filipski, J., et al. 1985, The mosaic genome of warm-blooded vertebrates, Science, 228, 953958.[Abstract/Free Full Text]
- Nadeau, J. H. and Taylor, B. A. 1984, Lengths of chromosomal segments conserved since divergence of man and mouse, Proc. Natl Acad. Sci. USA, 81, 814818.[Abstract/Free Full Text]
- Mouchiroud, D. and Gautier, C. 1988, High codon-usage changes in mammalian genes, Mol. Biol. Evol., 5, 192194.[Web of Science][Medline]
- Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J. 1990, Basic local alignment search tool, J. Mol. Biol., 215, 403410.[CrossRef][Web of Science][Medline]
- Grantham, R. 1980, Workings of the genetic code, Trend Biochem. Sci., 5, 327331.
- Ikemura, T. 1985, Codon usage and tRNA content in unicellular and multicellular organisms, Mol. Biol. Evol., 2, 1334.[Abstract]
- Zolotukhin, S., Potter, M., Hauswirth, W. W., Guy, J., Muzyczka, N. 1996, A "humanized" green fluorescent protein cDNA adapted for high-level expression in mammalian cells, J. Virol., 70, 46464654.[Abstract/Free Full Text]
- Ikemura, T. and Wada, K. 1991, Evident diversity of codon usage patterns of human genes with respect to chromosome banding patterns and chromosome numbers; relation between nucleotide sequence data and cytogenetic data, Nucleic Acids Res., 19, 43334339.[Abstract/Free Full Text]
- Bourque, G., Pevzner, P. A., Tesler, G. 2004, Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse, and rat genomes, Genome Res., 14, 507516.[Abstract/Free Full Text]

CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:

|
 |

|
 |
 
S. K. Bag, S. Paul, S. Ghosh, and C. Dutta
Reverse Polarization in Amino acid and Nucleotide Substitution Patterns Between Human Mouse Orthologs of Two Compositional Extrema
DNA Res,
September 25, 2007;
(2007)
dsm015v1.
[Abstract]
[Full Text]
[PDF]
|
 |
|