DNA Research Advance Access published online on September 30, 2008
DNA Research, doi:10.1093/dnares/dsn023
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Differential Selective Constraints Shaping Codon Usage Pattern of Housekeeping and Tissue-specific Homologous Genes of Rice and Arabidopsis
1 Bioinformatics Centre, Bose Institute, P 1/12, C.I.T. Scheme VII M, Kolkata 700 054, India
2 Biomedical Informatics Centre, National Institute of Cholera and Enteric Diseases, P- 33, CIT Scheme XM, Kolkata 700 010, India
Received 17 March 2008; accepted 2 September 2008.
| Abstract |
|---|
|
|
|---|
Intra-genomic variation between housekeeping and tissue-specific genes has always been a study of interest in higher eukaryotes. To-date, however, no such investigation has been done in plants. Availability of whole genome expression data for both rice and Arabidopsis has made it possible to examine the evolutionary forces in shaping codon usage pattern in both housekeeping and tissue-specific genes in plants. In the present work, we have taken 4065 rice–Arabidopsis homologous gene pairs to study evolutionary forces responsible for codon usage divergence between housekeeping and tissue-specific genes. In both rice and Arabidopsis, it is mutational bias that regulates error minimization in highly expressed genes of both housekeeping and tissue-specific genes. Our results show that, in comparison to tissue-specific genes, housekeeping genes are under strong selective constraint in plants. However, in tissue-specific genes, lowly expressed genes are under stronger selective constraint compared with highly expressed genes. We demonstrated that constraint acting on mRNA secondary structure is responsible for modulating codon usage variations in rice tissue-specific genes. Thus, different evolutionary forces must underline the evolution of synonymous codon usage of highly expressed genes of housekeeping and tissue-specific genes in rice and Arabidopsis.
Key words: error minimization; housekeeping; mRNA folding energy; synonymous rates; tissue specific; tRNA copy number
| 1. Introduction |
|---|
|
|
|---|
The completed genome sequences of rice1
200 million years (My) ago, with increment in GC content of some rice genes.3
Conversely, all the previous studies on housekeeping and tissue-specific genes have been done on human genome. Rice which is heterogeneous in base composition similar to human has not been explored till date. Rice–Arabidopsis pair is a well-known model to study codon usage divergence in plants.4
,21
Availability of whole genome expression data for both rice and Arabidopsis has made it possible to examine the pattern of evolutionary forces shaping codon usage in housekeeping and tissue-specific genes of these two plants. In the present study, we have traced the pattern of evolutionary forces shaping codon usage in both housekeeping and tissue-specific genes of rice and Arabidopsis and discussed the presence of contrasting selective constraint affecting the evolution of these sets of genes.
| 2. Materials and methods |
|---|
|
|
|---|
2.1. Sequence data
The genomes of rice and Arabidopsis were downloaded, respectively, from RiceGAAS Rice Genome Automated Annotation System ftp://ftp.dna.affrc.go.jp/pub/RiceGAAS/current/ and Arabidopsis Information Resource (TAIR) http://www.arabidopsis.org/. All sequences having <100 codons were ignored from our data set. Also, genes containing internal stop codons were removed and thus data set comprising a total of 18 658 rice genes was taken for further analysis.
Homologous genes between rice and Arabidopsis genomes were identified using gapped BLASTP searches using cut-off expects of 10.0 x 10–6.22
Pairs of coding sequences which have at least 30% amino acids positives and overlaps over at least 80% of their length were retained for the analysis. The maximum gap size allowed between a pair of sequence is 5%. Owing to presence of much multi-copy genes both in Arabidopsis and rice, some sequences from one species showed high levels of sequence similarity with more than one sequence from the other species. In those cases, the sequence pairs that produced higher degree of sequence similarity were retained.23
We also eliminated pseudo genes and mitochondrial protein from the homologous gene set. Finally, our data set consists of 4065 homologous gene pairs (Supplementary Table S1 contains rice–Arabidopsis homologous genes pairs).
2.2. Expression profile
The public domain MPSS (massively parallel signature sequencing) expression data for rice24
(http://mpss.udel.edu/rice/) and Arabidopsis25
(http://mpss.udel.edu/at/) present more accurate estimation of gene transcript levels and are easily accessible.25
The expression level of a gene expressed in a single library is estimated by counting the number of individual 17-base signature sequences representing each gene.26
It should be noted that current MPSS data set for rice is based on the TIGR rice genome annotation. We retrieved expression level of individual rice genes with RiceGAAS ID using Rice MPSS: Query by Sequence tool that basically extract all possible tags from the sequence and compare them against their database. The expression levels of a gene expressed in different expression libraries were estimated by calculating average expression values in all libraries considered (Supplementary Tables S2 and S3 contain library information). We sorted the expression values in each library in an ascending order, and then divided them into five groups, each containing 20% of the population.26
Individual genes were assigned an expression rank from 1 (low expression) to 5 (high expression) according to the increase in average expression level.
Tissue specificity of a gene is measured by using tissue specificity index
.27
,28
The
of gene i is defined by
|
|
value ranges from 0 to 1, with higher values indicating higher variations in expressional level across tissues or higher tissue specificities. If a gene has expression in only one tissue,
approaches 1. In contrast, if a gene is equally expressed in all tissues,
= 0.
We assigned housekeeping and tissue-specific genes by sorting our data set (4065 rice–Arabidopsis homologous genes) according to increase in
value and taking out genes from extreme 20% of population from both ends. Using the above criteria, we obtained 787 housekeeping and 770 tissue-specific genes. All our analysis were performed using 787 housekeeping and 770 tissue-specific genes of rice with its corresponding counterpart in Arabidopsis (Supplementary Tables S4 and S5 contain rice–Arabidopsis housekeeping and tissue-specific homologous gene-pairs).
2.3. Sequence analysis
Pair-wise synonymous (Ks) and non-synonymous (Ka) distance between the homologous genes of rice and Arabidopsis was calculated by using the method of Yang and Neilsen.29
The genetic robustness at codon level has been measured using CUB available at http://users.ox.ac.uk/~zool0643/codon/CUB.html.30
According to this method proposed by Archetti, we have measured dissimilarity (DAA/AA*) between original (AA) and mutant amino acid (AA*) for each synonymous codon based on the McLachlans matrix of chemical similarity.31
Dissimilarity of a single amino acid (AA) is given by: DAA/AA* =
AA/AA –
AA/AA*, where
AA/AA is the similarity of the amino acid AA to itself and
AA/AA* is the similarity of AA to the mutant amino acid AA* obtained after an error at one of the positions of the original codon. Since
AA/AA>
AA/AA* for every amino acid, DAA/AA* is always positive, and since there are three possible mutants for each position, there are nine possible measures of DAA/AA* for each codon, corresponding to nine possible mutant codons. Their mean value is taken as a measure of distance (dissimilarity) between the original codon and its possible mutants. This mean value of dissimilarity is the measure of mean distance (MD) for each codon to its possible mutants. To calculate the degree of error minimization of a coding sequence, the correlation between the MD values and the corresponding relative synonymous codon usage (RSCU) is calculated for each synonymous family. If N is the number of degenerate synonymous codon families on which the correlation is calculated, and R is the sum of the correlations, the degree of error minimization is measured by RN = R/N (RN ranging between –1 and +1). The RN measures genetic robustness with the assumption that all the amino acids are weighted equally, irrespective of their frequency on the protein. If the value of each correlation is weighted (multiplied) by the frequency of the corresponding amino acid, then the measure is denoted by wRN. Since MD is a measure of dissimilarity, the lower the value of RN and wRN, the higher the degree of error minimization.
The Zipfold program was used to predict free-folding energies for each native mRNA sequence available at http://dinamelt.bioinfo.rpi.edu/zipfold.php.
The transfer RNA gene copy number necessary to determine the major codons32
for each amino acid in rice were taken from Xiyin et al.33
and tRNA copy number for Arabidopsis was taken from http://lowelab.ucsc.edu/GtRNAdb/Athal/.
The Students t-test was used to evaluate the significance of all the pair-wise differences. The statistical tests were performed using the SPSS (13.0) package.
| 3. Results and discussion |
|---|
|
|
|---|
3.1. Influence of expression level in modulating synonymous substitution rates for both housekeeping and tissue-specific genes in rice
Analysis of synonymous substitution patterns (Ks) between rice and Arabidopsis homologous genes pairs for both housekeeping and tissue-specific classes reveals that housekeeping genes are under stronger selective constraint as observed from their significantly lower average synonymous substitution rates (Ks = 3.27) (P < 0.001) when compared with tissue-specific genes (Ks = 3.45). Similar trend in evolutionary rates have been observed in earlier studies on mammalian genome.15
|
3.2. Co-adaptation of synonymous codon usage with the tRNA pool of housekeeping and tissue-specific homologous genes in rice and in Arabidopsis
In an attempt to investigate the nature of selective constraint shaping synonymous codon usage of housekeeping and tissue-specific genes, we analyzed preferred codons in both the gene classes of rice (Table 2) and Arabidopsis (Table 3). Preferred codons are those that generally correspond to the most abundant tRNA species and they provide fitness benefits to highly expressed genes by enhancing translational efficiency.36
|
|
3.3. Selective constraint acting on mRNA secondary structure is responsible for regulating synonymous substitution rates in rice tissue-specific genes
It has already been demonstrated that there is a selection for local RNA secondary structures in coding regions and this nucleic acid structure resembles the folding profiles of the coded proteins.39
3.4. Mutational bias regulates error minimization in both rice and Arabidopsis homologous set
It is clear from our result that selective constraint shaping synonymous codon usage has taken a different turn in both housekeeping and tissue-specific highly expressed genes. Therefore, it is quite interesting to explore evolutionary forces acting on synonymous codon usage to optimize error minimization capacity of highly expressed housekeeping and tissue-specific genes in both the plants. The evolution of genetic code took place in such a way so that it can minimize errors due to mutation and mistranslation. The theory of error minimization for the evolution of genetic codes postulates that the codons are arranged in such a way that reduces errors.41
,42
Thus synonymous codons differ in their capacity to minimize the effects of errors due to mutation or mistranslation. In Drosophila melanogaster, the degree of error minimization is correlated with the degree of codon usage bias.43
Later, it was reported that the codon usage pattern of highly expressed genes in E. coli has been selected in such a way that mistranslation would have the minimum possible effects on the structure and function of the related proteins. Furthermore, according to Najafabadi et al.44
frequencies of codons in highly expressed genes that correspond to most abundant tRNA copy number may have been under selection pressure for error minimization. For rice genome, we have calculated the error minimization capacity (wRn) of housekeeping and tissue-specific genes. We observed significant lowering of wRn (P < 0.001) for housekeeping genes (wRn = –0.3322) with respect to tissue-specific genes (wRn = –0.2458). This result indicates the presence of stronger selective constraint on codon usage of housekeeping genes to achieve greater degree of error minimization capacity. We compared wRn between highly and lowly expressed genes of housekeeping and tissue-specific categories of rice genome (Table 4). We observed significantly (P < 0.001) greater error minimizing capacity for highly expressed housekeeping genes than lowly expressed housekeeping genes. Surprisingly, in tissue-specific genes, we observed no significant difference of error minimization between highly and lowly expressed genes in rice. Thus, selection on codon usage for error minimization has hardly had any role in distinguishing highly and lowly expressed tissue-specific genes. Our observations for housekeeping genes are in consistent with the previous findings that highly expressed genes are those having a strong preference for codons to minimize the effect of errors by mutation and mistranslation.30
,44
–47
We also performed the same analysis for Arabidopsis genes and observed that highly expressed genes in both housekeeping and tissue-specific categories have significantly (P < 0.001) greater error minimizing capacity than lowly expressed genes (Table 5). Therefore, selection acting on synonymous codon usage to optimize error minimization capacity of highly expressed genes equally influences both housekeeping and tissue-specific homologous genes of Arabidopsis. However, it is noteworthy that there is no significant difference in error minimizing capacity between highly expressed housekeeping and tissue-specific Arabidopsis genes. This discrepancy between translational selection driven by tRNA copy number and genetic robustness in both plants indicate that error minimizing capacity of highly expressed genes does not depend on selection based on tRNA abundance for both rice and Arabidopsis as observed in E. coli.44
,45
It is reasonable to assume from our results that frequencies of codons in highly expressed genes that correspond to most abundant tRNA copy number may not be under selection pressure for error minimization.
|
|
However, according to Archetti43
|
|
Correlation analysis was again performed between GC content and error minimization capacity of housekeeping genes in rice. A significant strong negative correlation (Rs = –0.606, P < 0.001) has been observed between error minimization capacity and GC content of housekeeping genes in rice. These lead us to conclude that in plants it is the mutational bias that regulates error minimization of highly expressed genes.
3.5. Conclusion
In this work, we studied how selective constraint shape synonymous codon usage of housekeeping and tissue-specific homologous genes in both rice and Arabidopsis. We observed that there is difference in codon usage pattern between housekeeping and tissue-specific genes in both rice and Arabidopsis genes. Although, previous studies on Drosophila and rodents favor selectionist model for error minimization at protein level,30
we demonstrated that mutational bias is responsible for the observed pattern of error minimization. We argue that error minimization at protein level has taken a different turn after the divergence of plants and animals. Moreover, our results show that housekeeping genes are under stronger selective constraint than that of the tissue-specific genes. Translational selection driven by tRNA copy number is responsible for optimizing codon usage variation in housekeeping genes. On the contrary, in housekeeping genes, selection acting on mRNA secondary structural stability of tissue-specific genes has a greater influence to modulate codon usage variation. Lavner and Kotlar48
argued that selection may act on codon bias to reduce elongation rate by favoring non-optimal codons in lowly expressed genes. In the present study, influence of mRNA secondary structural stability on codon usage variation of tissue-specific genes might be the consequence of favoring non-optimal codons in lowly expressed tissue-specific genes. Thus, our study unambiguously suggests that two sets of genes in rice and Arabidopsis (housekeeping and tissue specific) have evolved under contrasting evolutionary constraints.
| Supplementary Data |
|---|
|
|
|---|
Supplementary data are available online at www.dnaresearch.oxfordjournals.org.
| Funding |
|---|
|
|
|---|
Authors are thankful to Department of Biotechnology, Government of India for financial help.
| Acknowledgements |
|---|
|
|
|---|
Authors are also thankful to Dr Nakai Kenta and two anonymous reviewers for their fruitful constructive comments in improving the manuscript.
| Footnotes |
|---|
* To whom correspondence should be addressed. Fax. +91 33-2355-3886. E-mail: tapash{at}boseinst.ernet.in
| References |
|---|
|
|
|---|
- International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature (2005) 436:793–800.[CrossRef][Web of Science][Medline]
- The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature (2000) 408:796–815.[CrossRef][Web of Science][Medline]
- Bernardi G. Structural and Evolutionary Genomics: Natural Selection in Genome Evolution (2004) The Netherlands: Elsevier Amsterdam.
- Wang H. C., Hickey D. A. Rapid divergence of codon usage patterns within the rice genome. BMC Evol. Biol. (2007) 7:1–10.[CrossRef][Medline]
- Montero L. M., Salinas J., Matassi G., Bernardi G. Gene distribution and isochore organization in the nuclear genome of plant. Nucleic Acids Res. (1990) 18:1859–1867.
[Abstract/Free Full Text] - Carels N., Bernardi G. Two classes of genes in plants. Genetics (2000) 154:1819–1825.
[Abstract/Free Full Text] - Guo X., Bao J., Fan L. Evidence of selectively driven codon usage in rice: implications for GC content evolution of Gramineae genes. FEBS Lett. (2007) 581:1015–1021.[CrossRef][Web of Science][Medline]
- Wong G. K., Wang J., Tao L., et al. Compositional gradients in Gramineae genes. Genome Res. (2002) 12:851–856.
[Abstract/Free Full Text] - Sharp P. M., Averof M., Lloyd A. T., Matassi G., Peden J. F. DNA sequence evolution: the sounds of silence. Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci. (1995) 349:241–247.[Web of Science][Medline]
- Ponger L., Duret L., Mouchiroud D. Determinants of CpG islands: expression in early embryo and isochore structure. Genome Res. (2001) 11:1854–1860.
[Abstract/Free Full Text] - DOnofrio G. Expression patterns and gene distribution in the human genome. Gene (2002) 300:155–160.[CrossRef][Web of Science][Medline]
- Vinogradov A. E. Isochores and tissue-specificity. Nucleic Acids Res. (2003) 31:5212–5220.
[Abstract/Free Full Text] - Arhondakis S., Auletta F., Torelli G., DOnofrio G. Base composition and expression level of human genes. Gene (2004) 325:165–169.[CrossRef][Web of Science][Medline]
- Lercher M. J., Urrutia A. O., Pavlicek A., Hurst L. D. A unification of mosaic structures in the human genome. Hum. Mol. Genet. (2003) 12:2411–2415.
[Abstract/Free Full Text] - Duret L., Mouchiroud D. Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol. Biol. Evol. (2000) 17:68–74.
[Abstract/Free Full Text] - Hastings K. E. Strong evolutionary conservation of broadly expressed protein isoforms in the troponin I gene family and other vertebrate gene families. J. Mol. Evol. (1996) 42:631–640.[CrossRef][Web of Science][Medline]
- Hughes A. L., Hughes M. K. Self peptides bound by HLA class I molecules are deprived from highly conserved regions of a set of evolutionary conserved proteins. Immunogenetics (1995) 41:257–262.[Web of Science][Medline]
- Zhang L., Li W. H. Mammalian housekeeping genes evolve more slowly than tissue-specific genes. Mol. Biol. Evol. (2004) 21:236–239.
[Abstract/Free Full Text] - Plotkin J. B., Robins H., Levine A. J. Tissue-specific codon usage and the expression of human genes. Proc. Natl. Acad. Sci. USA (2004) 101:12588–12591.
[Abstract/Free Full Text] - Semon M., Lobry J. R., Duret L. No evidence for tissue-specific adaptation of synonymous codon usage in humans. Mol. Biol. Evol. (2006) 23:523–529.
[Abstract/Free Full Text] - Mukhopadhyay P., Basak S., Ghosh T. C. Nature of selective constraints on synonymous codon usage of rice differs in GC-poor and GC-rich genes. Gene (2007) 400:71–81.[CrossRef][Web of Science][Medline]
- Altschul S. F., Madden T. L., Schaffer A. A., et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.
[Abstract/Free Full Text] - Banerjee T., Gupta S. K., Ghosh T. C. Compositional transitions between Oryza sativa and Arabidopsis thaliana genes linked to the functional change of encoded proteins. Plant Sci. (2006) 170:267–273.
- Nakano M., Nobuta K., Vemaraju K., Tej S. S., Skogen J. W., Meyers B. C. Plant MPSS databases: signature-based transcriptional resources for analyses of mRNA and small RNA. Nucleic Acids Res. (2006) 34:D731–D735.
[Abstract/Free Full Text] - Meyers B. C., Tej S. S., Vu T. H., et al. The use of MPSS for whole-genome transcriptional analysis in Arabidopsis. Genome Res. (2004) 14:1641–1653.
[Abstract/Free Full Text] - Ren X. -Y., Vorst O., Fiers M. W. E. J., Stiekema W. J., Nap P. In plants, highly expressed genes are the least compact. Trends Genet. (2006) 22:528–532.[CrossRef][Web of Science][Medline]
- Liao B. Y., Zhang J. Low rates of expression profile divergence in highly expressed genes and tissue-specific genes during mammalian evolution. Mol. Biol. Evol. (2006) 23:1119–1128.
[Abstract/Free Full Text] - Yanai I., Benjamin H., Shmoish M., et al. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics (2005) 21:650–659.
[Abstract/Free Full Text] - Yang Z., Nielsen R. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. (2000) 17:32–43.
[Abstract/Free Full Text] - Archetti M. Selection on codon usage for error minimization at the protein level. J. Mol. Evol. (2004) 59:400–415.[CrossRef][Web of Science][Medline]
- McLachlan A. D. Tests for comparing related amino-acid sequences Cytochrome c and cytochrome c 551. J. Mol. Biol. (1971) 61:409–424.[CrossRef][Web of Science][Medline]
- Kotlar D., Lavner Y. The action of selection on codon bias in the human genome is related to frequency, complexity, and chronology of amino acids. BMC Genom. (2006) 7:67.[CrossRef]
- Xiyin W., Xiaoli S., Bailin H. The transfer RNA genes in Oryza sativa L. ssp. Indica. Sciences in China Series C (2002) 45:504–511.[CrossRef]
- Berg O. G., Martelius M. Synonymous substitution-rate constants in Escherichia coli and Salmonella typhimurium and their relationship to gene expression and selection pressure. J. Mol. Evol. (1995) 41:449–456.[CrossRef][Web of Science][Medline]
- Drummond D. A., Raval A., Wilke C. O. A single determinant dominates the rate of yeast protein evolution. Mol. Biol. Evol. (2006) 23:327–37.
[Abstract/Free Full Text] - Ikemura T. Transfer RNA in protein synthesis. Hatfield D. L., Lee B. J., Pirtle R. M., eds. (1992) Boca Raton, FL: CRC. 87–111.
- Duret L. tRNA gene number and codon usage in the C. elegans genome are co-adapted for optimal translation of highly expressed genes. Trends Genet. (2000) 16:287–289.[CrossRef][Web of Science][Medline]
- Percudani R. Restricted wobble rules for eukaryotic genome. Trends Genet. (2001) 17:133–135.[Web of Science][Medline]
- Biro J. C. Indications that "codon boundaries" are physico-chemically defined and that protein-folding information is contained in the redundant exon bases. Theor. Biol. Med. Model (2006) 3:28.[CrossRef][Medline]
- Jia M., Li Y. The relationship among gene expression, folding free energy and codon usage bias in Escherichia coli. FEBS Lett. (2005) 579:5333–5337.[CrossRef][Web of Science][Medline]
- Woese C. R. On the evolution of the genetic code. Proc. Natl. Acad. Sci. USA (1965) 54:1546–1552.
[Free Full Text] - Epstein C. J. Role of the amino-acid code and of selection for conformation in the evolution of proteins. Nature (1966) 210:25–28.[CrossRef][Web of Science][Medline]
- Archetti M. Genetic robustness and selection at the protein level for synonymous codons. J. Evol. Biol. (2006) 19:353–365.[CrossRef][Web of Science][Medline]
- Najafabadi H. S., Goodarzi H., Torabi N. Optimality of codon usage in Escherichia coli due to load minimization. J. Theor. Biol. (2005) 237:203–209.[CrossRef][Web of Science][Medline]
- Najafabadi H. S., Lehmann J., Omidi M. Error minimization explains the codon usage of highly expressed genes in Escherichia coli. Gene (2007) 387:150–155.[CrossRef][Web of Science][Medline]
- Bulmer M. The selection-mutation-drift theory of synonymous codon usage. Genetics (1991) 129:897–907.[Abstract]
- Akashi H. Synonymous codon usage in Drosophila melanogaster: natural selection and translational accuracy. Genetics (1994) 136:927–935.[Abstract]
- Lavner Y., Kotlar D. Codon bias as a factor in regulating expression via translation rate in the human genome. Gene (2005) 345:127–138.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||