Skip Navigation


DNA Research Advance Access originally published online on October 30, 2009
DNA Research 2009 16(6):371-383; doi:10.1093/dnares/dsp022
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
16/6/371    most recent
dsp022v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by Wakamatsu, A.
Right arrow Articles by Isogai, T.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wakamatsu, A.
Right arrow Articles by Isogai, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2009. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Identification and Functional Analyses of 11 769 Full-length Human cDNAs Focused on Alternative Splicing

Ai Wakamatsu1, Kouichi Kimura2, Jun-ichi Yamamoto3, Tetsuo Nishikawa3, Nobuo Nomura4, Sumio Sugano5 and Takao Isogai1,3,*

1 Graduate School of Pharmaceutical Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan
2 Central Research Laboratory, Hitachi, Ltd, Kokubunji, Tokyo 185-8601, Japan
3 Reverse Proteomics Research Institute, 1-9-11 Kaji, Chiyoda-ku, Tokyo 101-0044, Japan
4 National Institute of Advanced Industrial Science and Technology, 2-41-6 Aomi, Koto-ku, Tokyo 135-0064, Japan
5 Department of Medical Genome Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 4-6-1 Shiroganedai, Minato-ku, Tokyo 108-8639, Japan

Received 25 August 2009; accepted 1 October 2009.


    Abstract
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results and discussion
 Supplementary data
 Funding
 Acknowledgements
 References
 
We analyzed diversity of mRNA produced as a result of alternative splicing in order to evaluate gene function. First, we predicted the number of human genes transcribed into protein-coding mRNAs by using the sequence information of full-length cDNAs and 5'-ESTs and obtained 23 241 of such human genes. Next, using these genes, we analyzed the mRNA diversity and consequently sequenced and identified 11 769 human full-length cDNAs whose predicted open reading frames were different from other known full-length cDNAs. Especially, 30% of the cDNAs we identified contained variation in the transcription start site (TSS). Our analysis, which particularly focused on multiple variable first exons (FEVs) formed due to the alternative utilization of TSSs, led to the identification of 261 FEVs expressed in the tissue-specific manner. Quantification of the expression profiles of 13 genes by real-time PCR analysis further confirmed the tissue-specific expression of FEVs, e.g. OXR1 had specific TSS in brain and tumor tissues, and so on. Finally, based on the results of our mRNA diversity analysis, we have created the FLJ Human cDNA Database. From our result, it has been understood mechanisms that one gene produces suitable protein-coding transcripts responding to the situation and the environment.

Key words: full-length cDNA; alternative splicing; alternative transcription start site; mRNA diversity; tissue-specific expression


    1. Introduction
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results and discussion
 Supplementary data
 Funding
 Acknowledgements
 References
 
One of the most interesting findings revealed by the Human Genome Project is that the human genome contains only 20 000–25 000 protein-coding genes.1Go This number is unexpectedly too small. To explain this unexpected result and to understand functions of genes, it is necessary to analyze mRNA diversity.

Biologically, multiple transcripts can be generated from a single gene by alternative splicing (AS). According to several reports on genome research, AS occurs in 30–60% of human genes.2Go–5Go It has been reported that AS of a single gene could produce transcripts coding for multiple proteins, each exhibiting different biochemical properties including binding, intracellular localization and regulation of enzymatic activities.6Go AS is also of interest to the pharmaceutical research because unwanted AS of genes could lead to various genetic diseases and cancers.7Go We have particularly focused on the analysis of AS patterns that are produce by utilizing alternative transcription start sites (TSSs). Indeed, multiple transcripts were produced from a gene by utilizing variable TSSs.8Go,9Go For example, the Pcdh gene, which contained variable TSSs, was shown to produce different transcripts;10Go similarly, UGTs (UDP-glucuronosyltransferases), which contained more than 10 TSSs.11Go From these findings, it is clear that to elucidate gene function, we have to further our knowledge on and understanding of all transcripts made from each gene, particularly those of the protein-coding transcripts. However, identification of all protein-coding transcripts have so far been difficult due to the fact that a large number of EST data accumulated in the databases are 3'-EST data, which were obtained by sequencing cDNAs from the polyA-end. Thus, even though sequences of a large number of mRNAs are already known, our understanding of these mRNAs remained incomplete because of the fragmentary nature and 3'-end bias of their sequences. Because of the lack of sequence information, it has been difficult to predict TSSs and to identify all the open reading frame (ORF) regions. Although the use of next generation sequencer helped in making advances in analyzing TSSs, it still remains extremely difficult to evaluate diversities of mRNAs transcribed by each gene because of their accumulation of short-length sequences (less than 50 bases) of cDNA clones.12Go,13Go

We sequenced ~55 000 human full-length cDNAs, including 11 769 newly identified cDNAs described in this paper, and also obtained ~1.45 million 5'-end-one-pass sequences (5'-EST).14Go–17Go We believe that these cDNA sequences are very useful in analyzing the diversity of protein-coding transcripts and would definitely contribute to our understanding of mRNA. First, our cDNA clones were isolated from full-length human cDNA libraries constructed by an optimized oligo-capping method, and therefore by utilizing their sequence information, we were able to identify the TSS with 90% or better accuracy.14Go,18Go–20Go Thus, we could easily and accurately identify TSSs of even low-expressing genes, for which up until now it required comparison of a large amount of data.17Go Second, our 5'-EST data contained, on the average, sequence information of ~500 bases/cDNA clone, which covered two or more exons. Since the average length of the 5'-untranslated region is believed to be 125 bases,21Go it was possible to predict ORF regions using our 5'-EST data. Finally, the most important point is that all of our resources were obtained from the full-length cDNAs, including the TSS and the polyA site. Moreover, we could obtain various findings on protein expression from our full-length cDNAs.16Go These findings could not be obtained from sequences of short mRNA fragments. Since AS of genes could potentially create a large number of protein-coding transcripts, analyzing full-length cDNAs might be immensely valuable in understanding gene function.

Here, we report on our analysis of 11 769 full-length cDNAs, which were identified from our full-length cDNA libraries, and contained ORFs as a result of AS. We also present our analysis on the splice patterns and expression profiles of the identified cDNAs to explore the correlation between the mRNA diversity and gene function. Furthermore, we describe 261 full-length cDNAs with unique TSSs known as multiple variable first exon (FEV) and report on their expression profiles. Finally, we report establishing the FLJ Human cDNA Database based on the results of our analysis of the variable protein-coding transcripts generated from each gene by AS.


    2. Materials and methods
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results and discussion
 Supplementary data
 Funding
 Acknowledgements
 References
 
2.1. Construction of full-length cDNA libraries
Most total RNAs isolated from various tissues and cells were purchased from Clontech and Ambion. Cells were cultured following established protocol, and cytoplasmic total RNAs were extracted from these cultured cells following a standard RNA purification method. The list of total RNAs used in this study was shown in Supplementary Table S1. We constructed cDNA libraries from total RNAs by an optimized oligo-capping method (detailed method for the optimized oligo-capping is provided in the Supplementary Method 1).18Go,19Go Briefly, total RNAs were treated with bacteria alkaline phosphatase (TaKaRa) and tobacco acid pyrophosphatase. After that, total RNAs were ligated to the oligo-RNA using the RNA ligase (TaKaRa). Oligo-capped polyA(+) RNAs were then isolated oligo-dT columns. The first-strand cDNAs were synthesized using the Superscript II reverse transcriptase (Invitrogen), the synthesized cDNAs were amplified using the Gene Amp XL PCR kit (ABI) and the amplified product was digested with the restriction enzyme SfiI. Fragments longer than 2 kb were selected and purified by agarose gel electrophoresis and cloned into the DraIII-digested pME18SFL3 vector following the standard methods. The 5'-end-one-pass sequences of cloned cDNAs were analyzed using the ABI 377 and 3700 sequencers (ABI). The 5'-end fullness rate of the constructed oligo-capped cDNA libraries was evaluated as described previously,22Go,23Go and the detailed method for determining the 5'-end fullness rate is provided in the Supplementary Method 2.

2.2. Genome mapping and clustering
The 5'- and 3'-ends of cDNA sequences and the full-length cDNA sequences (Supplementary Table S2) were mapped onto the human genome (UCSC hg 18 NCBI Build 36.1). Possible local alignments between the cDNAs and genome sequences were identified by using the NCBI Mega BLAST program (ftp://ftp.ncbi.nih.gov/blast/). For each cDNA, best mapping of the sequence was determined from these local alignments using a dynamic programming technique that optimized the identity, coverage and topology of exons. The joining portions of consecutive local alignments were refined so as to restore the consensus sequence in the canonical splice sites. On the basis of the mapping results clustering of cDNA sequences were performed as follows: two cDNA sequences were grouped into the same cluster if their mapped positions shared at least one base on the genome. In general, each cluster corresponded to a single gene locus.

2.3. Identification of alternatively spliced variants of mRNAs
On the basis of the results of genome mapping and clustering analysis, ESTs that had different regions compared with known full-length cDNAs by AS were selected by Intris, a viewer for cDNA-genome alignments used for analysis of splicing variants and expression profiles.24Go To exclude the cDNA fragments derived from the immature mRNA and genomic DNA, reliability of mRNA was evaluated by using not only the human EST data but also the data conserved from other animals (Phastcons; obtained from UCSC Genome Browser). We predicted the ORF regions from the 5'-end sequences of full-length cDNAs on selected ESTs by using ATGpr (http://flj.lifesciencedb.jp/top/).25Go Next, we excluded those ESTs from the selected analytical targets when the predicted ORF regions of the selected ESTs were the same as the ORF regions of known full-length cDNAs. In addition, even if the predicted ORF regions were different from the ORF regions of known full-length cDNAs, we excluded cDNA clones containing extremely short ORF regions (mostly 60 amino acids or less) compared with the other full-length cDNAs that mapped in the same locus of the human genome. The selected cDNAs were further sequenced by primer walking method using an ABI3700 sequencer (ABI) to obtain information on 500 additional bases, and the ORF regions were predicted again by using the ATGpr.25Go We also evaluated the predicted ORF regions by using TRis,26Go translated region inspector, and examined their novelty of amino acid sequences by using ALVISION,27Go aligns two cDNA sequences that are splicing variants allowing large gaps. When the reliability of the predicted ORF region was insufficient, we excluded it from our list of analytical targets. When the predicted ORF regions of the selected cDNAs were judged reliable and different from those of the known full-length cDNAs, we then sequenced the full-length cDNA clone all the way up to the stop codon. Consequently, we completely sequenced 11 769 of full-length FLJ cDNAs and analyzed their tissue-specific expression. A detailed method for the analysis of the tissue-specific expression of the cDNAs is provided in the Supplementary Method 3. We have also constructed the FLJ Human cDNA Database (http://flj.lifesciencedb.jp) that contained these sequence information. A detailed method for the analysis of AS by using the information available in the FLJ Human cDNA Database is provided in the Supplementary Method 4. Sequences of 11 769 of our full-length cDNAs were also deposited in the DDBJ/GenBank/EMBL databases (AK293122 [GenBank] –AK304890 [GenBank] ).

2.4. Functional analysis of full-length cDNAs in silico
Sequences of cDNAs were analyzed for the signal sequences, trans-membrane domains and motifs in the encoded proteins by using Signal P ver. 3.0 (http://www.cbs.dtu.dk/services/SignalP/), SOSUI ver. 1.5 (Mitsui Knowledge Industry) and Pfam 19.0 (November 2005; http://pfam.sanger.ac.uk/), respectively. We obtained information on motifs showing E-values of e-30 or more from the Pfam analysis, and based on these results, we then categorized each cDNA and the corresponding gene according to its gene ontology (GO) (http://www.geneontology.org/) classification by using InterPro (http://www.ebi.ac.uk/interpro/).

2.5. Quantitative real-time PCR analysis
Total RNAs derived from various tissues were purchased from Clontech, Ambion and STRATAGENE (listed in Supplementary Table S4). From 10 µg of each total RNA, first-strand cDNAs were synthesized using random primers and the Superscript III reverse transcriptase (Invitrogen) following the manufacturer's instructions. Real-time PCR was performed using TaqMan Universal Master Mix (ABI) or SYBR Master Mix (ABI) on an ABI Fast7500 System (ABI) according to the manufacturer's instructions. Approximately 300 ng of template cDNAs was used in each PCR reaction. Probes and primers were designed using the Primer Express3.0 (ABI) (refer to Supplementary Table S5 for the list of primers). The expression levels of genes were normalized with respect to that of the human GAPDH, and expression values of individual genes were calculated by comparing their Ct values to that of the control using the RQ software (ABI). The expression levels of genes were represented in log10 base. Samples were run in duplicates and the data shown are the average of two experiments.


    3. Results and discussion
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results and discussion
 Supplementary data
 Funding
 Acknowledgements
 References
 
3.1. Identification of human genes
It is known that AS could produce mRNA diversity.2Go–6Go However, to analyze the mRNA diversity, it is necessary to identify human genes (i.e. the genome loci from where the protein-coding mRNAs are transcribed). We obtained 1.45 million human full-length cDNAs and sequenced their 5'-ends. We previously selected ~30 000 cDNAs from these full-length cDNAs based on the novelty analysis, and completely sequenced them.14Go–16Go Later, we also selected ~25 000 cDNAs based on the mRNA diversity and also sequenced them completely. In our quest to identify human genes, we used, for our analysis, the sequence information on these 55 000 full-length human cDNAs including 11 769 cDNAs reported in this paper (Supplementary Table S2). Furthermore, for the analysis, we not only used our own data but also data from 52 000 full-length human cDNA sequences available from the public databases, 30 000 human RefSeq (NCBI Reference Sequences; http://www.ncbi.nlm.nih.gov/RefSeq/) and 48 000 Ensembl, human gene transcripts (http://www.ensembl.org/index.html). In addition, we used EST sequences obtained by us and from other public databases (Supplementary Table S2). All the sequence data we collected were mapped onto the human genome and clustered. We then examined reliability of each full-length cDNAs by Intris24Go using sequences of all full-length cDNAs and ESTs mapped on the same locus of the genome, and based on this analysis, we selected only the reliable cDNAs for the gene identification analysis. We determined the genome locus of each one of the selected reliable cDNA and manually checked them one by one to identify the corresponding gene. As a result, we identified 23 241 human genes from this analysis (Fig. 1A). Each gene cluster was classified into three categories based on the reliability scores. The number of genes in the high reliability category (high category) were 16 754. Sequences of cDNAs belonging to the high-category group were found to be already analyzed because the genome locus was covered by sequence information available from the three types of databases, the human full-length cDNAs, RefSeq and Ensembl. It accounted for 72% of the total number of genes. The number of genes with intermediate reliability (medium category) was 2854. As for the medium-category group, the genome locus was covered by sequence information available from only the human full-length cDNAs or from two out of three of the above-mentioned databases. The number of genes with low reliability (low category) were 3633. As for the low-category group, the gene locus was covered by sequence information available only from the RefSeq or the Ensembl.


Figure 1
View larger version (43K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. Clustering of human cDNA sequences. (A) Estimation of the number of human genes from full-length cDNAs and ESTs. Outline of our gene prediction method from the human full-length cDNAs and ESTs mapped to human genome is schematically shown. For each one of the predicted genes, classification reliability was evaluated manually. (B) Cover rate of FLJ EST sequences and (C) cover rate of FLJ full-length sequenced cDNAs. Results of reliability analysis according to the category based on the cover rates of 1.45 million of ESTs (B) and 55 000 full-length cDNAs (C).

 
To further assess these reliabilities, we next calculated the cover rate of genes using our cDNAs. First, the cover rate was calculated using our 1.45 million FLJ ESTs, and we found a positive correlation between these reliabilities and the cover rate of FLJ ESTs (Fig. 1B). Next, we calculated the cover rate of genes using our 55 000 FLJ human full-length cDNA sequences. In this case, we also found a positive correlation between the reliability and the cover rate similar to that was observed for the ESTs (Fig. 1C). Thus, we were able to verify reliability irrespective of whether we used the sequences of our ESTs or full-length cDNAs in the analysis.

3.2. Analysis of AS and functional classification of sequenced full-length cDNAs by GO
We selected 25 000 full-length cDNAs from among the identified genes by focusing our attention on AS and subsequently sequenced them. In addition, from these cDNAs, we selected 11 769 of human full-length cDNAs in which the ORF regions were predicted to be different from the known full-length cDNAs, and then classified them by GO according to their predicted functions. First, ESTs exhibiting a different splicing pattern than the known full-length cDNAs were selected and were completely sequenced. From the sequence analysis, we were able to predict the ORF regions in only 30% of them (results not shown). Interestingly, a number of cDNA, for which we were unable to predict the ORF region, were thought to produced by AS. But, because our target was to be able to predict the function of the gene from the sequence of its transcript, it was necessary to select protein-coding transcripts efficiently. It is difficult to predict the ORF region correctly from the EST sequences lacking the TSS. However, our 5'-EST sequences not only contained the TSS but also contained sequence information on an average of 500 bases from the TSS. Therefore, we were able to correctly predict the ORF regions of our 5'-EST by using ATGpr.25Go As a result, the number of clones containing unpredictable ORF regions decreased to ~10%. Moreover, by using the tools such as TRins26Go for inspecting the translated region and ALVISION27Go for evaluating the novelty of amino acid sequences, we succeeded in identifying the ORF regions with high accuracy. Consequently, we obtained 11 769 of human full-length cDNAs in which the ORF regions were predicted to be different from the known full-length cDNAs (Supplementary Table S3). Ninety-six percent of these cDNAs-encoded proteins which differed in at least 10 amino acids from those encoded by their respective known full-length cDNAs, mainly because we selected them based on their altered ORF regions as a result of AS. These full-length cDNAs covered 7025 of 23 241 genes that we had originally identified.

Once it was established that human genes could produce multiple protein-coding transcripts, it was important to analyze their putative functions. The GO classification analysis was performed for all 11 769 our full-length cDNAs using Pfam, and their predicted functions, obtained from this analysis, are summarized in Table 1. The classification results revealed that a large number of our cDNA clones were listed under the GO molecular function categories ‘nucleotide binding’, ‘nucleic acid binding’, ‘protein binding’, ‘hydrolase activity’, ‘transferase activity’ and ‘oxidoreductase activity’. Because 11 769 of our full-length cDNAs had ORF regions different from those of the known full-length cDNAs, we also analyzed their functions by predicting domains and motifs using Pfam, SOSUI and SignalP (Supplementary Table S3). Consequently, we discovered full-length cDNAs that encoded proteins with altered functional domains and signal sequences as a result of AS.


View this table:
[in this window]
[in a new window]

 
Table 1. Functional classification of the 11 769 full-length cDNAs based on the molecular function hierarchy of GO

 
3.3. Classification of splicing patterns of full-length cDNAs
Up until now, majority of the ESTs entered in the public databases were 3'-EST. We succeeded in constructing full-length cDNA libraries efficiently by using the optimized oligo-capping method and obtained ~1.4 million 5'-ESTs of full-length cDNAs constructed by this method.18Go,19Go Our 5'-EST sequences were especially useful for the analysis of TSSs because 90% or more of our cDNAs contained the TSSs. We analyzed the splicing patterns of the 11 769 cDNAs by using the 5'-EST sequence data (Fig. 2). Results of this analysis revealed that 3403 cDNAs, which correspond to ~30% of all cDNAs, were transcribed using alternative TSSs (Type A), and thus, the predicted proteins contained new amino acid sequences at their N-terminal ends. In addition, 1962 cDNAs in Type A (designated as Type A1) contained FEV, due to transcripts originating from a TSS that was previously ignored because it was mapped in an intron region of the genome or transcripts originating from a TSS that was mapped upstream from the one that was analyzed before. Taken together, these results led to the discovery of new exons. We analyzed expression profiles of the genes containing multiple TSSs and discovered that the same gene could code for proteins with diverse function in different tissues by the proper utilization of alternative TSS. There were 8277 cDNAs (i.e. ~70% of all the full-length cDNAs) that were transcribed from the previously identified TSSs, but contained different ORF region because of AS; they were designated as Type B. Because we used our 5'-EST data for the selection, a lot of Type B cDNAs were predicted to contain N-terminal sequences different from those of the known cDNAs, except for a portion of cDNAs which were either selected by PCR or found during sequencing analysis. To assess whether AS or use of alternative TSS could alter the function of the predicted protein, we compared the GO functional categories of the Type A and Type B (Table 2). Our results showed that majority of the Type A belonged mainly to the GO molecular function categories of ‘neurotransmitter binding’, ‘enzyme activator activity’, ‘cyclase activity’, ‘ATPase activity, coupled to movement of substances’ and ‘GTPase regulator activity’. Thus, by using our 5'-EST data, a lot of valuable information were obtained regarding the diversity of TSS and amino acid sequences at the N-terminal ends of proteins. However, since only a portion of the full-length cDNAs was selected for this analysis, information on sequence diversity in regions beyond 500 bases from the TSSs were not obtained. We believe that there are additional alternately spliced transcripts which remained to be analyzed in the future studies.


Figure 2
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2. Classifications of the 11 769 full-length cDNAs based on splicing patterns. The 11 769 human full-length cDNAs were classified according to their TSS utilization. Type A: these cDNAs were derived from transcripts which were generated utilizing a TSS different than the previously analyzed TSS of the gene. Type A1: cDNAs contained a sequence variation known as FEV. Type A2: this class of cDNAs did not have the FEV feature. Type B: these cDNAs were derived from transcripts that were generated utilizing the same TSS as the previously analyzed TSS, but were found to be alternatively spliced. We could not classify 89 cDNAs because they coded for newly identified proteins.

 


View this table:
[in this window]
[in a new window]

 
Table 2. Functional classification of two types of splicing patterns of 11 769 full-length cDNAs based on GO category analysis

 
3.4. Analysis of genes showing tissue-specific expression
We analyzed expression of genes producing multiple protein-coding transcripts by AS and found that many of these transcripts were expressed in specific tissues or cells, suggesting that the genes likely use this diversity according to the need and situation. We next analyzed expression profiles of 10 069 cDNAs, which corresponded to 5542 genes, out of 11 769 full-length cDNAs we identified in this study. As our cDNA libraries were constructed using RNAs derived from more than 100 different types of tissues and cells, we therefore used the 5'-EST data for analyzing gene expression. We next analyzed gene expression profiles of Type A1 cDNAs containing the FEV diversity and found that the FEVs of 261 cDNAs, which correspond to 155 genes, showed specific expression patterns that were different from those already obtained for the genes with alternative TSSs (Table 3). Thus, like the genes with alternative TSSs, the expression patterns of the genes with FEVs likely depended on the tissue and condition. Consequently, we found genes producing multiple protein-coding transcripts by AS.


View this table:
[in this window]
[in a new window]

 
Table 3. Expressions of a selected list of 261 FEV-containing cDNAs (155 genes)

 
3.5. Analysis of expression patterns of tissue-specific expressed genes
We quantified tissue-specific expressions of 13 out of 261 selected cDNAs by real-time PCR (Fig. 3). Results of our analysis especially suggested that there was a strong relationship between the tissue-specific expression and diversity of gene function or disease. We compared the expression profile of a specific gene by utilizing the TSS identified in this study with that of the same gene in which a previously identified TSS was utilized for expression. These results are summarized in Supplementary Table S6 and are discussed below in more detail.


Figure 3
View larger version (36K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3. Quantitative evaluation of selected genes by real-time PCR. Expression levels of the first exon regions of the selected genes were analyzed by real-time PCR. The data were normalized with respect to that of the human GAPDH as described in the Materials and methods section. The expression levels of genes were represented in log10 base. Expression levels of cDNAs labeled ‘$$’ represent the very low expression level or undetected. (A) FGF13, (B) OXR1, (C) C6orf142, (D) PLD5, (E) FGD4, (F) C6orf32. BW, brain, whole; BC, brain, cerebellum; BF, fetal brain; SP, spleen; BM, bone marrow; TH, thymus; OV, ovary; PR, prostate; UT, uterus; MT, mixture of tumor human tissues; MN, control, mixture of normal human tissues; KT, kidney tumor; LT, lung tumor.

 
First example, FGF13 is a gene that belongs to the FGF family and is believed to play roles in cell proliferation and differentiation, and also in neuronal differentiation.28Go,29Go FLJ57884 and FLJ57068 cDNAs exhibited different ORF regions as a result of FEV and were splicing variants of the known FGF13 cDNA. The TSSs we found in each one of them were located upstream from the TSS of FGF13. Whereas the known TSS of FGF13 was expressed highly in both fetal and adult brains, the TSSs of both FLJ57884 and FLJ57068 cDNAs were highly expressed only in the fetal brain. Moreover, the TSS of our FLJ57068 cDNA was also expressed highly in the kidney cancer (Fig. 3A). Second example, OXR1 is one of the oxidation stress receptivity genes localized in mitochondria.30Go The TSS of known OXR1 was expressed at equal levels in various tissues. But the TSS we identified in the FLJ56044 cDNA was located upstream from the known TSS of OXR1 and was highly expressed in brain, kidney cancer and lung cancer (Fig. 3B). Thus, these results suggested that these two genes were using different TSSs to regulate their expression levels in the brain. Moreover, our results also suggest that, for both genes, only one of the TSSs was preferentially recognized by the transcription machinery in the cancerous tissue.

Third example, C6orf142 (chromosome 6 ORF 142) is a gene of an unknown function. The known TSS of C6orf142 was highly expressed in the heart. However, the TSS we identified in the FLJ58494 cDNA, which was located downstream from the previously identified TSS of C6orf142, was highly expressed in both fetal and adult brains (Fig. 3C). Fourth example, PLD5 is one of the phospholipid-splitting enzymes presumably involved in the intracellular signaling.31Go Although the known TSS of PLD5 was expressed equally in various tissues, the TSS we identified in the FLJ57051 cDNA, which was located downstream of the previously identified TSS of PLD5, was highly expressed in the brain (Fig. 3D). Fifth example, SPRED2 is a Ras inhibitory factor belonging to the Sprouty/Spred family.32Go The TSS we identified in the FLJ52731 cDNA, which was located downstream from the known TSS of SPRED2, was expressed highly in the brain (Supplementary Table S6). Sixth example, SEMA5B is a nerve guidance factor which is involved in organogenesis, angiogenesis and oncogenesis.33Go The TSS we identified in the FLJ55460 cDNA, which was located downstream from the known TSS of SEMA5B, also was expressed highly in the brain (Supplementary Table S6). Seventh example, CACNB3 is a calcium channel beta-3 subunit, which is involved in modifying sympathetic nervous system, olfaction and control of blood pressure.34Go Although the known TSS of CACNB3 was expressed highly in both fetal and adult brains, the newly identified TSSs of FLJ58949 and FLJ58411 cDNAs, both of which were located downstream from the known TSS of CACNB3, were expressed at a low level in the brain (Supplementary Table S6). These cDNAs exhibited different ORF regions as a result of AS. Eighth example, BACE1 is a peptide hydrolase that cleaves the amyloid precursor protein and is one of the factors involved in Alzheimer's disease.35Go The known TSS of BACE1 was expressed equally in various tissues. However, the TSS we identified in the FLJ54690 cDNA, which was located downstream from the known TSS of BACE1, was expressed highly in the brain (Supplementary Table S6). Thus, these six genes regulated their expression levels in the brain using a specific TSS in each gene.

Ninth example, FGD4 is a gene that seemed to be involved in the regulation of the actin in the cytoskeleton and cell shape and also have various roles in proliferation, differentiation, transcriptional regulation and development.36Go The known TSS of FGD4 was highly expressed in the nervous system tissues such as brain, spinal cord and testis. However, the TSS we identified in the FLJ55905 cDNA, which was located downstream from the known TSS of FGD4, was highly expressed in the immune system tissues such as bone marrow and spleen (Fig. 3E). Tenth example, C6orf32 is a gene of unknown function whose expression level increased during the myoblast differentiation of the embryo.37Go FLJ56038 and FLJ56137 cDNAs exhibited different ORF regions as a result of FEV and were splicing variants of the known C6orf32 cDNA. The known TSS of C6orf32 was expressed at equal levels in various tissues. However, the TSSs we found in FLJ56038 and FLJ56137 cDNAs were located upstream of the known TSS of C6orf32, and both of these newly identified TSSs were highly expressed in the immune system tissues such as bone marrow, spleen and thymus (Fig. 3F). Eleventh example, PTPN4 is a gene belonging to the PTP (tyrosine escape phosphoric acid enzyme) family that works as a transmitter and controls various cellular processes like cell proliferation, differentiation, mitotic cycle and oncogenesis.38Go The known TSS of PTPN4 was highly expressed in the brain, but the TSS we identified in the FLJ53929 cDNA, which was located downstream from the known TSS of TPN4, was highly expressed in the immune system tissues such as bone marrow and spleen (Supplementary Table S6). Twelfth example, BTNL8 is one of the butyrophilin-like proteins and seemed to be involved in conferring immunity.39Go The known TSS of BTNL8 was found to be expressed at equal levels in various tissues. However, the TSS we identified in the FLJ51528 cDNA, which was located downstream from the known TSS of BTNL8, was highly expressed in the lung and thymus (Supplementary Table S6). Thus, it seems that these four genes regulated their expression levels in the immune system tissues by using specific TSSs.

Thirteenth example, AKT1 is a gene involved in apoptosis and neuronal differentiation and also may have a role in schizophrenia, especially in the neurotransmission system.40Go The TSS we identified in the FLJ53606 cDNA, which was located downstream from the known TSS of AKT, was highly expressed in the retinoic acid-induced NT2 cells (Supplementary Table S6). Thus, this gene uses a specific TSS during the neuronal differentiation.

Thus, among the newly identified genes we have analyzed in this study, the TSSs of a number of these genes revealed specific expression patterns. These results suggest that a single gene could use alternative TSS for tissue-specific transcription. We also found a close relationship between the predicted function of a gene and its tissue-specific expression. Thus, our results suggest a strong correlation between the mRNA diversity and function of a gene.

3.6. Construction and use of the FLJ Human cDNA Database
We constructed the FLJ Human cDNA Database ver. 3.0 (http://flj.lifesciencedb.jp) based on the results of our analysis of variable protein-coding transcripts produced from a gene by AS. A detailed description of our DB is available at the DB website. The DB graphically displays mapping of all the full-length cDNAs in the human genome and their ORF regions and thus provides a lot of useful information on the mRNA diversity. Moreover, the DB not only contain sequence information on full-length human cDNAs but also contain sequence information on a huge number of human ESTs generated using the oligo-capping method, allowing us to obtain useful information on ESTs mapped on the same genome locus. Because the average length of our EST sequences was ~500 bases, the diversity of mRNAs produced as a result of AS could be efficiently analyzed by using this information. Because we were able to accurately identify TSSs using our 5'-EST data, we believe that they could be used to understand the relationship between the variable utilization of TSSs and biological functions of genes. Moreover, one could analyze the expression profiles of the transcriptional region of genes using the data from our high accuracy 5'-EST sequences, although in some cases the results might be different from those obtained using the 3'-EST data.

Despite these useful features, our database specializes on 5'-end sequences, and therefore these data are not suitable for predicting AS in the C-terminal end. Then, a lot of AS-related information still remain to be extracted from our 1.4 million cDNA resources as all of them were not sequenced to completion. Because our cDNA resources are mostly full-length cDNAs including the TSS and the polyA site, complete sequencing of these cDNA clones will add to our understanding of the mRNA diversity. In addition, every full-length sequenced FLJ cDNAs is available from the National Institute of Technology and Evaluation (http://www.nite.go.jp/). We will continue to add new information on our resources to our database, and these resources will be very useful in the analysis of gene functions.

Because our interest was on the mRNAs with ORF regions different from those of already known mRNAs, we stopped sequencing the cDNA once we found that the predicted ORF region of the transcript was not different from the known mRNA (for instance, where the alternative TSS only existed in the 5'-untranslated region). We, however, found that there is a tissue specificity in the expression patterns of these genes where the variation in TSS existed in the 5'-untranslated region (results not shown). Collectively, these results suggest that depending on the situation and environment, the transcription machinery utilizes alternative TSS to regulate the expression of a transcript, even when the translated protein is same. These results are also included in our DB. We also did not complete sequencing the clones for which we were unable to predict the ORF regions of their mRNAs. However, we have also included these clones in the DB with the belief that one could obtain some new and useful information by analyzing these clones.

We discovered a lot of genes had mRNA diversity due to, for example, FEVs. We also found a lot of tissue-specific splicing patterns. Especially, in the case of FEVs that we analyzed, genes used different regions of the genome loci as the first exon, which seemed to be dependent on the tissue and its condition. We also discovered genes, the TSSs of which were located further away on the same genome locus of the gene. In these cases, there exists a high possibility that their transcription is controlled by individual transcription factors. As the mechanisms for controlling the transcription are closely related to the function, by understanding these mechanisms one could be able to artificially control the expression of an appropriate transcript in the future.

In this study, we have identified multiple transcripts producing genes, and we believe that each one of these genes is transcribed into an appropriate transcript according to the need and circumstance. Now, it will be important to know whether there is any correlation between the expression of one of the transcripts produced by a gene and a disease. For example, in the case of transcripts containing FEVs, which we analyzed in detail, only the first exon regions were different from the other previously characterized transcripts. Since the first exon regions of these transcripts are unique, it is possible to distinguish them easily from the other transcripts. It may be possible to control the expression of a specific mRNA from a group of mRNAs transcribed from a gene by targeting the first exon. As we accumulate more information on mRNA diversity of genes using approaches similar to what we have described in this study, we might be able to identify candidate genes as novel targets for the development of drugs with lower side effects.


    Supplementary data
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results and discussion
 Supplementary data
 Funding
 Acknowledgements
 References
 
Supplementary data are available at www.dnaresearch.oxfordjournals.org.


    Funding
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results and discussion
 Supplementary data
 Funding
 Acknowledgements
 References
 
This work was partly supported by a grant from New Energy and Industrial Technology Developmental Organization (NEDO) project of the Ministry of Economy, Trade and Industry of Japan.


    Acknowledgements
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results and discussion
 Supplementary data
 Funding
 Acknowledgements
 References
 
We thank the members of the NEDO functional analysis of protein and research application project for cDNA sequencing and clone stock, especially thanks to M. Yamazaki, K. Watanabe, A. Sugiyama, Y. Ono, T. Takayama (Japan Biological Informatics Consortium) and K. Fujita (National Institute of Technology and Evaluation, Japan). We also thank the members of the NEDO human full-length cDNA sequencing project for EST sequencing.


    Footnotes
 
* Corresponding author. E-mail: tisogai{at}mol.f.u-tokyo.ac.jp

Edited by Osamu Ohara


    References
 Top
 Abstract
 1. Introduction
 2. Materials and methods
 3. Results and discussion
 Supplementary data
 Funding
 Acknowledgements
 References
 

  1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature (2004) 431:931–45.[CrossRef][Medline]
  2. Lander E.S., Linton L.M., Birren B., et al. Initial sequencing and analysis of the human genome. Nature (2001) 409:860–921.[CrossRef][Medline]
  3. Lopez A.J. Alternative splicing of pre-mRNA: developmental consequences and mechanisms of regulation. Annu. Rev. Genet. (1998) 32:279–305.[CrossRef][Web of Science][Medline]
  4. Black D.L. Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell (2000) 103:367–70.[CrossRef][Web of Science][Medline]
  5. Modrek B., Lee C. A genomic view of alternative splicing. Nat. Genet. (2002) 30:13–9.[CrossRef][Web of Science][Medline]
  6. Stamm S. Signals and their transduction pathways regulating alternative splicing: a new dimension of the human genome. Hum. Mol. Genet. (2002) 11:2409–16.[Abstract/Free Full Text]
  7. Bracco L., Kearsey J. The relevance of alternative RNA splicing to pharmacogenomics. Trends Biotechnol. (2003) 21:346–53.[CrossRef][Web of Science][Medline]
  8. Landry J.R., Mager D.L., Wilhelm B.T. Complex controls: the role of alternative promoters in mammalian genomes. Trends Genet. (2003) 19:640–8.[CrossRef][Web of Science][Medline]
  9. Zhang T., Haws P., Wu Q. Multiple variable first exons: a mechanism for cell- and tissue-specific gene regulation. Genome Res. (2004) 14:79–89.[Abstract/Free Full Text]
  10. Wu Q., Maniatis T. Large exons encoding multiple ectodomains are a characteristic feature of protocadherin genes. Proc. Natl Acad. Sci. USA (2000) 97:3124–9.[Abstract/Free Full Text]
  11. Strassburg C.P., Oldhafer K., Manns M.P., Tukey R.H. Differential expression of the UGT1A locus in human liver, biliary, and gastric tissue: identification of UGT1A7 and UGT1A10 transcripts in extrahepatic tissue. Mol. Pharmacol. (1997) 52:212–20.[Abstract/Free Full Text]
  12. Wang E.T., Sandberg R., Luo S., et al. Alternative isoform regulation in human tissue transcriptomes. Nature (2008) 456:470–6.[CrossRef][Web of Science][Medline]
  13. Licatalosi D.D., Mele A., Fak J.J., et al. HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature (2008) 456:464–9.[CrossRef][Web of Science][Medline]
  14. Ota T., Suzuki Y., Nishikawa T., et al. Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat. Genet. (2004) 36:40–5.[CrossRef][Web of Science][Medline]
  15. Otsuki T., Ota T., Nishikawa T., et al. Signal sequence and keyword trap in silico for selection of full-length human cDNAs encoding secretion or membrane proteins from oligo-capped cDNA libraries. DNA Res. (2005) 12:117–26.[Abstract]
  16. Goshima N., Kawamura Y., Fukumoto A., et al. Human protein factory for converting the transcriptome into an in vitro-expressed proteome. Nat. Methods. (2008) 5:1011–7.[CrossRef][Web of Science][Medline]
  17. Kimura K., Wakamatsu A., Suzuki Y., et al. Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes. Genome Res. (2006) 16:55–65.[Abstract/Free Full Text]
  18. Maruyama K., Sugano S. Oligo-capping: a simple method to replace the cap structure of eukaryotic mRNAs with oligoribonucleotides. Gene (1994) 138:171–4.[CrossRef][Web of Science][Medline]
  19. Suzuki Y., Sugano S. Construction of a full-length enriched and a 5'-end enriched cDNA library using the oligo-capping method. Methods Mol. Biol. (2003) 221:73–91.[Medline]
  20. Suzuki Y., Yoshitomo-Nakagawa K., Maruyama K., Suyama A., Sugano S. Construction and characterization of a full length-enriched and a 5'-end-enriched cDNA library. Gene (1997) 200:149–56.[CrossRef][Web of Science][Medline]
  21. Suzuki Y., Ishihara D., Sasaki M., et al. Statistical analysis of the 5' untranslated region of human mRNA using ‘Oligo-Capped’ cDNA libraries. Genomics (2000) 64:286–97.[CrossRef][Web of Science][Medline]
  22. Nishikawa T., Ota T., Kawai Y., et al. Database and analysis system for cDNA clones obtained from full-length enriched cDNA libraries. In. Silico Biol. (2002) 2:5–18.[Medline]
  23. Nishikawa T., Ota T., Kawai Y., et al. Comparison of sequences of cDNA clones obtained from oligo-capping cDNA libraries with those from unigene. DNA Res. (2001) 8:255–62.[Abstract]
  24. Kimura K., Nishikawa T., Nagai K., Sugano S., Isogai T. Intris: A viewer for cDNA-genome alignments enabling efficient detection of splicing variants and expression profiles. Genome Inform. (2002) 13:548–50.
  25. Salamov A.A., Nishikawa T., Swindells M.B. Assessing protein coding region integrity in cDNA sequencing projects. Bioinformatics (1998) 14:384–90.[Abstract/Free Full Text]
  26. Kimura K., Nishikawa T., Nagai K., Sugano S., Nomura N., Isogai T. The translated region inspector for cDNA sequences. Genome Inform. (2003) 14:456–7.
  27. Yamamoto J., Hatano N., Araki H., et al. A cDNA evaluation system for highly efficient sequencing of splicing variant cDNAs. Genome Inform. (2003) 14:430–1.
  28. Facchiano A., Russo K., Facchiano A.M., et al. Identification of a novel domain of fibroblast growth factor 2 controlling its angiogenic properties. J. Biol. Chem. (2003) 278:8751–60.[Abstract/Free Full Text]
  29. Greene J.M., Li Y.L., Yourey P.A., et al. Identification and characterization of a novel member of the fibroblast growth factor family. Eur. J. Neurosci. (1998) 10:1911–25.[CrossRef][Web of Science][Medline]
  30. Durand M., Kolpak A., Farrell T., et al. The OXR domain defines a conserved family of eukaryotic oxidation resistance proteins. BMC Cell Biol. (2007) 8:13.[CrossRef][Medline]
  31. Foster D.A., Xu L. Phospholipase D in cell proliferation and cancer. Mol. Cancer. Res. (2003) 1:789–800.[Abstract/Free Full Text]
  32. Nonami A., Kato R., Taniguchi K., et al. Spred-1 negatively regulates interleukin-3-mediated ERK/mitogen-activated protein (MAP) kinase activation in hematopoietic cells. J. Biol. Chem. (2004) 279:52543–51.[Abstract/Free Full Text]
  33. Adams R.H., Betz H., Puschel A.W. A novel class of murine semaphorins with homology to thrombospondin is differentially expressed during early embryogenesis. Mech. Dev. (1996) 57:33–45.[CrossRef][Web of Science][Medline]
  34. Yamada Y., Masuda K., Li Q., et al. The structures of the human calcium channel alpha 1 subunit (CACNL1A2) and beta subunit (CACNLB3) genes. Genomics (1995) 27:312–9.[CrossRef][Web of Science][Medline]
  35. De Pietri Tonelli D., Mihailovich M., Di Cesare A., Codazzi F., Grohovaz F., Zacchetti D. Translational regulation of BACE-1 expression in neuronal and non-neuronal cells. Nucleic Acids Res. (2004) 32:1808–17.[Abstract/Free Full Text]
  36. Chen X.M., Splinter P.L., Tietz P.S., Huang B.Q., Billadeau D.D., LaRusso N.F. Phosphatidylinositol 3-kinase and frabin mediate Cryptosporidium parvum cellular invasion via activation of Cdc42. J. Biol. Chem. (2004) 279:31671–8.[Abstract/Free Full Text]
  37. Yoon S., Molloy M.J., Wu M.P., Cowan D.B., Gussoni E. C6ORF32 is upregulated during muscle cell differentiation and induces the formation of cellular filopodia. Dev. Biol. (2007) 301:70–81.[CrossRef][Web of Science][Medline]
  38. Young J.A., Becker A.M., Medeiros J.J., et al. The protein tyrosine phosphatase PTPN4/PTP-MEG1, an enzyme capable of dephosphorylating the TCR ITAMs and regulating NF-kappaB, is dispensable for T cell development and/or T cell effector functions. Mol. Immunol. (2008) 45:3756–66.[CrossRef][Web of Science][Medline]
  39. Rhodes D.A., Stammers M., Malcherek G., Beck S., Trowsdale J. The cluster of BTN genes in the extended major histocompatibility complex. Genomics (2001) 71:351–62.[CrossRef][Web of Science][Medline]
  40. Brugge J., Hung M.C., Mills G.B. A new mutational AKTivation in the PI3K pathway. Cancer. Cell (2007) 12:104–7.[CrossRef][Web of Science][Medline]

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
16/6/371    most recent
dsp022v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by Wakamatsu, A.
Right arrow Articles by Isogai, T.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wakamatsu, A.
Right arrow Articles by Isogai, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?