DNA Research Advance Access originally published online on May 16, 2008
DNA Research 2008 15(3):123-136; doi:10.1093/dnares/dsn010
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Fine Expression Profiling of Full-length Transcripts using a Size-unbiased cDNA Library Prepared with the Vector-capping Method
1 Department of Rehabilitation Engineering, Research Institute, National Rehabilitation Center for Persons with Disabilities, 4-1 Namiki, Tokorozawa, Saitama 359-8555, Japan
2 Department of Biological Applied Chemistry, Graduate School of Engineering, Toyo University, Kujirai 2100, Kawagoe, Saitama 350-8585, Japan
Received 24 December 2007; accepted 10 April 2008.
| Abstract |
|---|
|
|
|---|
Recently, we have developed a vector-capping method for constructing a full-length cDNA library. In the present study, we performed in-depth analysis of the vector-capped cDNA library prepared from a single type of cell. As a result of single-pass sequencing analysis of 24 000 clones randomly isolated from the unamplified library, we identified 19 951 full-length cDNA clones whose intactness was confirmed by the presence of an additional G at their 5' end. The full-length cDNA content was >95%. Mapping these sequences to the human genome, we identified 4513 transcriptional units that include 36 antisense transcripts against known genes. Comparison of the frequencies of abundant clones showed that the expression profiles of different libraries, including the distribution of transcriptional start sites (TSSs), were reproducible. The analysis of long-sized cDNAs showed that this library contained many cDNAs with a long-sized insert up to 11 199 bp of golgin B, including multiple slicing variants for filamin A and filamin B. These results suggest that the size-unbiased full-length cDNA library constructed using the vector-capping method will be an ideal resource for fine expression profiling of transcriptional variants with alternative TSSs and alternative splicing.
Key words: full-length cDNA; expression profile; transcriptional start site; alternative splicing; antisense transcript
| 1. Introduction |
|---|
|
|
|---|
The Human Genome Project disclosed sequence information on an entire set of elements that construct and regulate each human cell, tissue, and organ.1
Large-scale analyses using these technologies have produced an enormous amount of data on human full-length transcripts,15
–17
as well as on expression profiles,18
–20
and have revealed the following facts. (i) Multiple transcripts of different sizes are generated from the same gene locus by alternative promoter usage,21
diverse transcriptional initiation,22
alternative splicing,23
and alternative polyadenylation.24
(ii) There are an unexpectedly large amount of sense–antisense pairs of transcripts.25
,26
(iii) Many non-coding RNAs are transcribed.15
,16
(iv) A considerable amount of transcripts estimated by expressed sequence tags (EST) of the UniGene database27
or by genome tiling array28
,29
still remain to be annotated because of their low abundance.
These data sets were accumulated using diverse cell types or tissues, because the first phase of the project was aimed at comprehensive gene collection. Hereafter, a second phase of investigation requires in-depth analysis of transcripts in a single type of cell population to elucidate the intracellular transcriptional regulatory network in a given cell. However, the full-length cDNA libraries constructed using the conventional methods do not meet the requirements for in-depth analyses because of the low complexity of the constituents. Furthermore, the presence of transcriptional variants raises a serious problem regarding the use of the conventional method for expression profiling of genes. If multiple variants transcribed from the same gene locus have a different function, we need to determine the expression profile of each variant to elucidate their role in the regulation network. However, the conventional methods that measure the amount of the limited region of each transcript by counting the number of sequence tags or by quantifying the hybridization signal cannot distinguish these variants without the precedent information of the full sequence of all variants expressed in a given cell.
The two problems described above, the low complexity of the full-length cDNA library and the presence of multiple variants, can be solved by analyzing all variants that comprise a bias-free full-length cDNA library constructed from a single type of cell. Unfortunately, the conventional methods for synthesizing full-length cDNA were unsuitable for constructing a bias-free cDNA library that reflects an expression level of each transcript in the cell, because they have many processes during which the ratio of components may change. It is especially difficult to construct a bias-free library containing rare or long-sized full-length cDNA clones without changing their content.
Recently, we have developed the vector-capping (V-capping) method for synthesizing full-length cDNA.30
This method is expected to be suitable to construct a bias-free cDNA library, because it consists of only three steps: the first-strand cDNA synthesis using a vector primer, self-ligation of the cDNA-vector construct, and the replacement of mRNA by the second-strand cDNA. The previous paper showed that we were able to construct the cDNA libraries containing full-length cDNA clones of >95% content without any selection procedure for full-length cDNAs. The further advantage of this library is that we can validate full-length cDNA by the presence of an additional G at its 5' end.
In this paper, we performed in-depth analysis of the cDNA libraries constructed using the V-capping method from the total RNA isolated from human retinal pigment epithelial cell line ARPE-19, and demonstrated that the constructed library was useful as a starting resource not only for the comprehensive collection of full-length cDNA clones, but also for fine expression profile analysis of transcripts expressed in a single type of cell, including variants generated by alternative promoter usage, alternative transcriptional initiation, alternative splicing, and alternative polyadenylation.
| 2. Materials and methods |
|---|
|
|
|---|
2.1. Cell culture and RNA preparation
Human retinal pigment epithelium (RPE) cell line ARPE-19 was obtained from American Type Culture Collection (Manassas, VA, USA). ARPE-19 cells were cultured in Dulbecco's modified eagle's medium: nutrient mixture F-12 (Invitrogen, Carlsbad, CA, USA) containing 10% fetal bovine serum. The cells were incubated for 4 days to confluence in a humidified atmosphere of 5% CO2 and 95% air at 37°C. The cells were harvested by trypsinization. Total RNA was isolated using ISOGEN (NIPPON GENE, Tokyo, Japan).
2.2. Vector primer
A plasmid vector pGCAP10 was constructed by substituting the cloning site EcoRI–AflII–SwaI–KpnI of pGCAP130 with SwaI–EcoRI–FseI–EcoRV–KpnI. The nucleotide sequence of pGCAP10 is available from GenBank/EMBL/DDBJ under accession no. AB371573. The plasmid pGCAP10 was digested with KpnI and tailed with
60 nucleotides of dT using terminal deoxynucleotidyl transferase (Takara Bio, Ohtsu, Shiga, Japan) according to the Okayama and Berg method.31
After digestion with EcoRV, the dT-tailed plasmid vector was purified on agarose gel and used as a vector primer. A cDNA insert can be cut out from the vector with two 8-nucleotide restriction enzymes, SwaI and NotI.
2.3. cDNA synthesis with the V-capping method
Two libraries, Lib-1 (ARe) and Lib-2 (ARi), were constructed using the V-capping method.30
Lib-1 has already been described in the previous paper.30
Lib-2 was different from Lib-1 in terms of the source of total RNA, a pGCAP10-derived vector primer, and a modified protocol. The experimental conditions were the same as described in the previous paper. A mixture of 5 µg of total RNA and 0.15 µg of pGCAP10-derived vector primer was incubated at 65°C for 5 min. The first-strand cDNA was synthesized using SuperScript IIITM reverse transcriptase (Invitrogen). The reaction mixture was incubated at 45°C for 3 h. After phenol/chloroform extraction followed by ethanol precipitation, the pellet was dissolved in water. The next step in the original protocol is self-ligation with T4 RNA ligase (Takara Bio). The present protocol includes an EcoRI digestion step before self-ligation. The EcoRI digestion was performed in 200 µL of a reaction mixture containing 50 mM Tris–HCl (pH 7.5), 10 mM MgCl2, 1 mM dithiothreitol, 100 mM NaCl, and 0.2 U/µL of EcoRI (Takara Bio). The reaction mixture was incubated at 37°C for 1.5 h. After phenol/chloroform extraction followed by ethanol precipitation, the pellet was dissolved in water. Self-ligation and second-strand synthesis were performed in the same way as in the previous paper.30
2.4. Construction of cDNA library
Transformation of Escherichia coli cells DH12S was performed using an electroporation method as previously described.30
Transformants were plated on LB agar without amplification. Colonies grown on the plates were picked manually or using a Flexys Colony Picker (Genomic Solutions, Ann Arbor, MI, USA) and suspended in 96-well or 384-well plates. After incubation and the addition of 50% glycerol, the original plates were stored at –80°C.
2.5. Plasmid isolation and sequencing
The isolated plasmid DNA or DNA amplified using the illustra TempliPhiTM DNA amplification kit (GE Healthcare, Uppsala, Sweden) was used as a template for sequencing. DNA sequencing from the 5' end of the cDNA insert was carried out with a capillary DNA sequencer (Applied Biosystems Inc., Foster City, CA, USA) using a BigDyeTM Terminator Cycle sequencing FS Ready reaction kit. The full sequence of the cDNA insert was determined by a primer walking method.
2.6. BLAST search and annotation
First, the 5'-end sequences were used to query our custom database for human full-length cDNA clones (Homo-Protein cDNA bank)4
with a software GENETYXR-PDB (GENETYX Co., Tokyo, Japan). Most of the abundant genes, ribosomal RNAs, and mitochondria-derived sequences were identified by this search. Sequences not matching to entries in our custom database were used to query the NCBI Human Genome database (National Center for Biotechnology Information, Bethesda, MD, USA) with the BLAST algorithm.32
Each search was carried out manually, and the sequence alignment and map shown on the NCBI's Map Viewer were checked visually by us. Most sequences were mapped to the first exon of a known gene locus. If the query sequence was mapped to the upstream region of a known gene locus in the same direction, the sequence was assigned to that gene. Through the websites linked to the Map Viewer, including Entrez Gene33
and UniGene,27
we retrieved information on gene name, gene symbol, gene ID, chromosomal location, and RefSeq34
accession number. Sequences not mapped to the known gene locus were BLAST-searched against the NCBI database, including non-redundant nucleotide sequences and ESTs. EST sequences not included in Entrez Gene and the determined full sequences of long-sized cDNAs were deposited in GenBank/EMBL/DDBJ under accession numbers AB371430
[GenBank]
–AB371572
[GenBank]
and AB371574
[GenBank]
–AB371588
[GenBank]
, respectively.
2.7. Estimation of the total number of genes composing libraries
The total number of genes constituting the library was estimated according to two approaches used for species richness estimation: non-sampling-based extrapolation and statistical sampling approaches.35
The former was performed by curve fitting to a gene-accumulation curve using asymptotic models, including negative exponential models and hyperbolic models.35
The curve fitting was carried out using software KaleidaGraph (Synergy Software, Reading, PA, USA). The latter approach used an abundance-based coverage estimator model ACE-1, a modified ACE for highly heterogeneous communities.36
The calculation was done using the SPADE (Species Prediction and Diversity Estimation) algorithm.37
2.8. Quantitative real-time PCR
First-strand cDNA was synthesized with oligo(dT)30 as a primer from 20 µg of total RNA using SuperScript IIITM reverse transcriptase (Invitrogen), and then purified by a Wizard PCR Preps DNA Purification System (Promega, Madison, WI, USA). Real-time PCR was performed using TaqMan Universal Master Mix (Applied Biosystems) on an ABI PRISM 7000 Sequence Detection System (Applied Biosystems) according to the manufacturer's instructions. One microlitter of diluted cDNA, equivalent to 300 ng of the initial total RNA template, was used in each reaction. Probes and primers designed by TaqMan Gene Expression Assays (Applied Biosystems) were used for the assays of ACTB (Hs99999903_m1), CFL1 (Hs00830568_g1), FLNA (Hs99999905_m1), FLNB (Hs00181698_m1), GAPDH (Hs99999905_m1), GUK1 (Hs00176133_m1), MYH9 (Hs00159522_m1), and RAI14 (Hs00210238_m1). The expression level was calculated based on a standard curve prepared for each gene using a plasmid with each cDNA as a template.
| 3. Results |
|---|
|
|
|---|
3.1. cDNA Library
Two cDNA libraries, Lib-1 and Lib-2, were constructed using the V-capping method from the total RNA isolated from ARPE-19. The construction of Lib-1 and part of its analysis were described in a previous paper.30
Table 1 shows the contents of each library classified by single-pass sequencing analysis. More than 90% clones provided the high-quality sequence data necessary for sequence analysis. The unreadable sequence may result from (i) deletion of a sequencing primer site on the vector, (ii) mixing of different clones, or (iii) failure of template DNA preparation. Many cases were attributed to the first reason because they showed no sequencing signals and could not be cut by restriction enzymes adjacent to the upstream of the cDNA insert. Each library contained insert-free vectors (3.3–3.9% in content), which may result from uncut vectors escaping from removal during the vector primer preparation process.
|
Lib-1 contained clones carrying a dT tail at the 5' end (1.9% in content). Inspection of the downstream sequence of these clones showed that they lacked a poly(A) tail and contained the inversely inserted cDNA whose 5' end was joined to the KpnI-cut end of the vector where the 3'-protruding bases were deleted. These clones may be generated by use of an aberrant vector primer that has a dT tail at the opposite end from the intended one, implying that the dT-tail addition to only one end of the vector plasmid occasionally occurred at the dT-tailing step with terminal deoxynucleotidyl transferase. Although the dT tail added to the opposite end should be removed by EcoRV digestion after the dT-tailing reaction, some must remain uncut. These clones starting with the dT tail could be removed by adding an EcoRI digestion step before self-ligation, as shown in the modified protocol. In fact, it worked so well that these artifacts drastically decreased to 0.16% in the ARi library and to 0.13% in the ARiS library.
The same mechanism can also explain the addition of a vector-derived sequence, ATCCTG in the case of using pKA1U5 as a vector primer, adjacent to the 5' end of
5% of cDNA clones isolated from Lib-1. In this case, the opposite end might not have been dT-tailed and in addition might have escaped from EcoRV digestion. Although the EcoRI digestion step in the modified protocol was expected to remove this kind of additional sequence, 2.8% of clones in Lib-2 still had a residual sequence (CGGCCGGCCGAT) derived from the vector sequence located between the EcoRI and EcoRV sites because of incomplete digestion with EcoRI.
3.2. Assessment of full-length cDNA
The 5'-terminal sequences of cDNA inserts were used for the database search. Clones having only poly(A) or the sequence of several bases with poly(A) were classified into a truncated cDNA, because these clones might be derived from degraded mRNA. The sequence similarity was searched against our custom human full-length cDNA database, the NCBI RefSeq database, and the human genome database using the BLAST algorithm. The libraries contained cDNA clones for mitochondria genome-derived transcripts (
1% of cDNA-carrying clones) and rRNA (<0.06%). Except for mitochondrial clones, all sequences were able to map on the human genome.
Most of the full-length cDNA clones identified as a known gene have a transcriptional start site (TSS) near to that of RefSeq. However, some sequences did not match the 5'-terminal sequence of RefSeq, presumably because our cDNA had a longer 5' UTR than RefSeq or was transcribed by the usage of an alternative promoter. These sequences were mapped on the human genome and the location of the sequence was determined. If the query sequence was located near the upstream region of the first exon of RefSeq or could be linked to the RefSeq via other mRNA or EST sequences that partly shared the query sequence, the clone was assigned to the gene for the corresponding RefSeq even though there was no shared sequence between the query and RefSeq.
Consequently, 8275 clones from the ARe/ARf library, 5586 from ARi, and 6090 from ARiS were identified as a full-length cDNA. Most clones (93.6% for Lib-1, 92.3% for Lib-2) had an additional G or NG (N: T, TT, G, etc.) at their 5' end, which is a requisite for full-length cDNA starting from the cap site when using the V-capping method. The previous paper showed that some full-length cDNAs had no additional G.30
Thus, if the cDNA not having a 5'-end G started at the upstream region of the first exon of the known gene, it was admitted to be a full-length cDNA for the corresponding gene. Although some cDNAs contained a repetitive sequence such as an Alu repeat, they were also precisely mapped on the genome. Most of the truncated cDNAs existed as a short form with a poly(A) tail. Even such short cDNA was assigned to full-length when it was mapped to the region that was not a known gene locus and had an additional G at the 5' end. Consequently, the content of the full-length clone in all clones carrying a cDNA insert [including only a poly(A) insert] was calculated to be 95.5% for ARe/ARf, 95.2% for ARi, and 95.1% for ARiS.
Out of 19 951 full-length cDNA clones, 1123 clones (5.6%) lacked an additional 5'-end G. These 5'-G-free genes were classified into two groups. One group (625 clones) consisted of clones starting with a nucleotide A. Another group (309 clones) was a 5'-terminal oligopyrimidine tract (5'-TOP) gene family that started with a pyrimidine-rich sequence, including predominantly ribosomal proteins. The genes that contained more than three 5'-G-free clones (G – ) with the same TSS were listed in Supplementary Table 1. It should be noted that G-added clones (G + ) corresponding to each G– clone were obtained except for NDUFB11. The content of G– clones was 0.04–0.31 for 16 kinds of 5'-TOP genes. On the other hand, the G– content for 12 out of 15 A-starting genes was 0.38–1.0, higher than for 5'-TOP genes. Although we could not find any conserved sequence in the 5' end of the G-free A-starting genes, the following finding suggests that the 5'-end sequence affected the addition or elimination of a cap structure. We obtained 82 clones starting with 5'-ACCACGCACG... for MT2A, out of which 44 had an additional G and 38 were G-free as shown in Supplementary Table 1. In addition, we obtained 43 clones starting with the fourth nucleotide A of the previous MT2A clone, i.e. 5'-ACGCACG.... Interestingly, all 43 clones had an additional G, suggesting that the presence of 5'-end three-nucleotide sequence resulted in the production of G-free clones.
3.3. Gene annotation
Mapping a total of 19 951 full-length cDNA sequences to the genome revealed that they were classified into 4513 kinds of transcriptional units (i.e. genes), of which 4370 (96.8%) were included in Entrez Gene. All genes are listed in the order of GeneID in Supplementary Table 2. The list contains the symbol, GeneID, and name, which were retrieved from Entrez Gene, and in addition the accession number of the RefSeq, mRNA size, chromosomal location, and number of clones obtained from each library. Of 4370 genes, 4271 had an accession number with prefix "NM_" that indicates mRNA. The remaining 99 genes (2.3%) were possible non-coding genes.
|
3.4. Expression profile
Fig. 1 shows the frequency distribution of abundant full-length genes obtained from three libraries. The most abundant genes were GAPDH and FTH1 (each 248 clones, 1.2% in content) followed by ACTB, ACTG1, EEF1A1, VIM, RPL41, MT2A, RPL1, CRYAB, TMSB10, RPS3, and RPL10, each of which gave
100 clones (0.5% in content). The number of abundant genes with
0.05% content (10 clones) was 310 (6.9% of identified genes). The major components were ribosomal proteins (79 kinds). Of the total genes, 2221 (49.2%) were obtained as a non-redundant transcript.
|
In order to examine the presence of bias on the expression frequency in different libraries, we compared the expression profiles of different libraries. Fig. 2A shows the comparison between frequencies of abundant genes with
0.1% content identified from the different pools (ARi and ARiS) of the same library (Lib-2). Although the plots scattered owing to the small sampling number, the correlation between two expression profiles was good (the correlation coefficient = 0.94). As listed in Supplementary Table 2, there were many genes for which several clones were obtained from one library but no clone from another library, suggesting that we should take the extent of sampling bias into account when we analyze a small number of samples.
|
Fig. 2B shows the comparison between frequencies of the different libraries, Lib-1 and Lib-2. Although the order of top 10 genes was different, the correlation was good on the whole (the correlation coefficient = 0.87). These two libraries were constructed from different lots of RNA using slightly different protocols. The variation may result from the difference of RNA lots rather than the difference of protocol. For example, it was reported that the expression level of CRYAB largely varied depending on the conditions of cell culture, such as heat stress38
3.5. Estimation of the total number of genes composing the library
The cumulative number of gene occurrences was plotted as a function of the number of analyzed clones as shown in Fig. 3A. The cumulative number asymptotically increased but did not saturate within the analyzed range. The curve fitting was carried out using six asymptotic models, including negative exponential models and hyperbolic models. As a result, the best fitting was obtained by use of a hyperbolic curve, Dt = St
/(β + t
), where Dt denotes the cumulative number of genes for the accumulated number t of sequenced clones, S is an asymptotic value, and
and β are parameters to be estimated from data. The obtained S was 14 348 for Lib-1 and 11 563 for Lib-2. Using the value for Lib-2, the cumulative gene number at 24 000 analyses was calculated to be 4578, which is similar to the observed value, 4513. Even if analyzing 48 000 and 96 000 clones, the cumulative gene number could merely reach 6177 (53.4%) and 7717 (66.7%), respectively.
|
Ida et al.43
3.6. TSSs of abundant transcripts
The 5'-terminal sequence analysis of full-length cDNA clones has the advantage of massive production of TSSs.22
In particular, the full-length content of the vector-capped cDNA library is so high that the distribution of TSSs of transcriptional initiation variants can be determined for abundant transcripts obtained from each library. The full-length content of the top 11 abundant genes was 94.0–100% for each gene cluster. In addition, truncated cDNAs were easily distinguishable, because most of them started at the position in the last exon and had no additional G at the 5' end.
The distributions of TSSs for 11 genes, which were obtained from two libraries and DBTSS,44
were compared as shown in Supplementary Table 3. The most frequent TSS for each gene, except for RPL1 and CRYAB, was identical among three distributions. EEF1A1 showed almost only one TSS, but generally multiple preferential TSSs were observed in other genes. Fig. 4 shows examples of comparison among distributions of TSSs for GAPDH, ACTG1, and CRYAB. The preferential TSSs formed a cluster in the region of
10 nt, presumably owing to the presence of the TATA box upstream of these TSSs. CRYAB also showed widely scattered rare TSSs at the upstream region. In every gene, the distribution patterns of TSSs obtained from two libraries and DBTSS were similar. It should be noted that the data of DBTSS was obtained from various tissues, suggesting that the pattern of TSS distribution shows no tissue specificity in the case of abundant housekeeping genes. DBTSS represented TSSs generated by alternative promoter usage.44
The TSS observed in the first intron of GAPDH and VIM, and two TSSs identified at the position
1400 nt upstream of the main TSS cluster of CRYAB, may result from transcripts given by alternative promoter usage.
|
3.7. Long-sized transcripts
On the basis of the mRNA size of RefSeq for genes with GeneID in Supplementary Table 2, the gene-based and clone-based average lengths of cDNA inserts of our full-length gene collection were calculated to be 2.46 kb (4378 genes) and 1.68 kb (19 758 clones), respectively. The previous paper showed that the V-capping method could synthesize a long-sized full-length cDNA.30
7 kb, as listed in Table 2. The size of the mRNA described in the RefSeq data was often different from the insert size of our cDNA clone because of size variants generated by alternative splicing or alternative polyadenylation. Therefore, cDNA clones that correspond to RefSeq derived from >6 kb mRNA were selected, and then the real size of the insert was examined by restriction enzyme digestion followed by agarose gel electrophoresis. Table 2 shows the determined size together with the mRNA size of RefSeq. Some clones were fully sequenced and their precise length was listed with an accession number.
The longest cDNA of 11 199 bp encoded golgin B1 (GOLGB1), which is a Golgi integral membrane protein originally named giantin owing to its huge size of
400 kDa.45
When compared with RefSeq, the coding region of this cDNA had two insertions of 15 bp each that resulted in the insertion of a total of 10 amino acid residues, implying an alternative-splicing variant. A total of nine single-nucleotide variations were observed, one of which was the insertion of one nucleotide A to an A stretch at position 2958–2965 in the coding region, causing a frame-shift. This insertion may result from misreading of reverse transcriptase during first-strand synthesis, because sequencing of the GOLGB1 locus of the ARPE-19 genome showed the absence of such an insertion.
Redundant cDNA clones in the long-sized cDNAs of
8 kb were filamin A (FLNA) and filamin B (FLNB). As a result of full sequencing of a total of eight FLNA clones, three splicing variants were identified, whose exon–intron structures are depicted in Fig. 5A. V1 was the longest variant and three clones showed identical TSS. The V2 clones started at the 6th nucleotide downstream of the TSS of V1, and lacked the 29th exon, which caused deletion of eight amino acid residues. V3 had the same TSS as V2, and lacked the region from the middle of exon 36 to the middle of exon 41 of V1, which caused deletion of 305 amino acid residues. RefSeq seems to correspond to V2, but it is doubtful that it actually dose because RefSeq is constructed using multiple sequences reported by different researchers. Surprisingly, all four FLNB clones showed different splicing patterns, as shown in Fig. 5B.
|
3.8. Correlation between the number of isolated full-length clones and mRNA content
The high content of the very long-sized cDNA clones leads us to expect that the present library is unbiased by mRNA size. In order to assess the extent of bias, the mRNA contents for eight genes with different mRNA sizes (1.1–9.5 k) were measured by real-time PCR. As shown in Fig. 6, the total number of full-length clones obtained from the present libraries had good correlation with the content of each mRNA. GUK1 (1.1 k), RAI14 (5.0 k), and FLNB (9.5 k) showed a similar content and were represented by full-length cDNA clones with the number of the same order of magnitude independent of their mRNA sizes. Thus, the bias is expected to be low up to 9.5 kb, explaining the fact that a small-sized library composed of only 20 000 clones contained 48 long-sized full-length clones with a cDNA insert of
7 kb (Table 2).
|
3.9. Unannotated transcripts
Of the 4513 full-length transcripts identified, 143 (3.2%) have not been included in Entrez Gene, but 79 out of them hit rare ESTs registered in the UniGene EST database. Although 19 clones did not hit any EST sequence, it did not necessarily mean they were novel genes because our clone may result from a partial sequence not overlapping with known ESTs. The unannotated transcripts can be classified into two groups. One group included an intergenic transcript that was mapped to the gene-unoccupied space between known gene loci. Many of them were transcribed from just upstream of TSS of the known gene toward the opposite direction, presumably owing to the presence of a bi-directional promoter.46
| 4. Discussion |
|---|
|
|
|---|
In-depth analysis of the full-length cDNA libraries constructed using the V-capping method has revealed that these libraries contained full-length cDNA clones with a wide range of sizes up to 11 kb, suggesting that they meet the requirement for size-unbiased libraries. This success may be attributed to the following reasons. (i) Total RNA was used as a template without purifying poly(A) RNA. (ii) The protocol of cDNA synthesis did not include an intact mRNA selection process such as modification of the cap structure.4
The present analysis also revealed that the full-length content was unexpectedly high despite lacking a full-length selection process. The full-length content of abundant transcripts was 94.0–100%. The most probable explanation for this high content is that most mRNA molecules are intact in the cell. However, the degradation of mRNA, especially long-sized mRNA, seemed to occur during the RNA isolation process. The overall full-length content was 95%, and analysis of the remaining 5% incomplete cDNAs showed that
10% of them were derived from the degradation product of the long-sized mRNA of >6 kb (data not shown). Although a size bias due to RNA degradation during the course of its isolation is inevitable, this method can faithfully reflect the composition of a given purified RNA sample.
The high full-length content and size-unbiased feature of the library enabled us to determine the fine distribution of TSS of genes expressed in a single type of cell. Recently, methods such as CAGE12
or 5'SAGE13
have been developed to determine the distribution of TSS. For these analyses, a vector-capped library prepared using a modified vector primer for CAGE or 5'SAGE analysis can serve as a starting material. The additional G at the 5' end could ensure the intactness of TSS and provide more precise results.
The vector-capped cDNA library also enabled us to analyze the presence of a cap structure at the 5' end of mRNA. The large-scale TSS analysis revealed that a small part of full-length clones lacked an additional 5'-end G, indicating the absence of the cap structure of the mRNA. The common feature of the 5'-end sequence of these clones was a pyrimidine-rich sequence or A-starting sequence. Further in-depth analysis of G-free clones suggested that the 5'-end sequence affected the addition or elimination of the cap structure. This finding may partially be explained by the recent results that human decapping enzyme Dcp2 preferentially binds to a subset of mRNAs and identifies sequences at the 5' terminus of the mRNA as a specific substrate.47
Further investigation is required to elucidate the mechanism of the cap-free mRNA generation observed in the present study and its biological meaning.
We identified 4513 kinds of genes out of 19 951 full-length clones isolated from unamplified ARPE-19 cDNA libraries. The total number of genes consisting of the present library was estimated to be 8000–14 000, which is consistent with
10 000 estimated for the transcriptome of RPE using UniGene clusters, SAGE tags and ESTs by Swaroop and Zack.48
The expression profile analyses using a microarray with 12 600 gene probes identified 5634 ± 65 genes for ARPE-19 and 5580 ± 84 genes for human RPE.49
These results suggest the low expression level of unidentified genes that may include non-coding and antisense genes. In order to obtain the remaining rare genes, further large-scale analyses in combination with subtraction/normalization are required.
Recent investigations have shown that an unexpectedly large amount of alternative splicing variants exist23
and that some variants were expressed in a cell-specific manner.50
In order to analyze a transcriptional network in a single type of cell, it is necessary to measure the expression levels of these variants. When alternative splicing occurs at the multiple sites, it is difficult to quantify each variant by the RT–PCR method, which measures only the expression level of a limited region of mRNA. The most precise method for determining the expression level of a splicing variant is to count the number of full-length cDNA clones for each variant. This is a particular requirement for long-sized full-length transcripts. The present study demonstrated that multiple splicing variants for long-sized genes such as FLNA and FLNB were expressed in a single type of cell and that the V-capping method was able to provide unique genuine full-length cDNA to each variant. The physiological role of each variant remains to be solved.
One of the characteristics of the V-capping method is the use of a vector primer with a relatively long dT tail, which has the following merits compared with the use of an oligo dT primer by conventional methods: (i) the unidirectional insertion of the cDNA is guaranteed and makes it easy to identify an antisense transcript against the known gene; (ii) it does not prime a short A-stretch in the mRNA, not generating 3'-truncated cDNAs that were occasionally observed in the RefSeq database (e.g. N4BP2 in Table 2). However, we should keep in mind the possibility that the vector primer could fail to capture mRNA with a short poly(A) tail.
The remaining challenge to analyzing the vector-capped cDNA library is to develop a method for high-throughput single-pass sequencing. When we intend to analyze >100 000 clones, the present single-pass sequencing strategy is not realistic because of high cost and because of the amount of work involved. However, considering that the analysis of the present high-quality full-length cDNA library is expected to enable discovery of rare or long-sized full-length cDNA clones and fine expression profiling of various variants, it makes sense to analyze more than a million clones corresponding to total clones composed of one library. Thus, developing a novel method for massive single-pass sequencing is desired, for example, by applying recently developed technologies.51
Furthermore, in order to achieve a comprehensive collection of full-length transcripts, the removal of abundant clones should be carried out by subtraction.
In conclusion, a full-length cDNA library constructed with the V-capping method has been shown to meet requirements for a size-unbiased library, and thus is expected to be suitable for the comprehensive collection of full-length transcripts and their fine expression profiling. Recently, Miura et al.52
analyzed full-length cDNA libraries constructed from budding yeast using the V-capping method and identified novel full-length transcripts, including splicing variants and antisense transcripts.52
Even for well-characterized organisms such as yeast, in-depth analysis of the vector-capped cDNA library has been shown to still bring discovery of novel full-length transcripts. To analyze the transcriptome of not only uncharacterized organisms but also well-characterized ones such as human and mouse, the V-capping method will become the first choice for constructing a full-length cDNA library.
| Supplementary Data |
|---|
|
|
|---|
Supplementary data are available online at www.dnaresearch.oxfordjournals.org.
| Funding |
|---|
|
|
|---|
This work was partly supported by a grant for the Genome Network Project from the Ministry of Education, Culture, Sports, Science and Technology. Funding for open access charge: Open access charge will be paid by Seishi Kato.
| Acknowledgement |
|---|
|
|
|---|
We thank Takumi Nonami for technical assistance for full sequencing of long-sized clones.
| Footnotes |
|---|
* To whom correspondence should be addressed. Tel. +81 4-2995-3100. Fax. +81 4-2995-3132. E-mail: seishi{at}rehab.go.jp
| References |
|---|
|
|
|---|
- Lander E. S., Linton L. M., Birren B., et al. Initial sequencing and analysis of the human genome. Nature (2001) 409:860–921.[CrossRef][ISI][Medline]
- Venter J. C., Adams M. D., Myers E. W., et al. The sequence of the human genome. Science (2001) 291:1304–1351.
[Abstract/Free Full Text] - International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature (2004) 431:931–945.[CrossRef][ISI][Medline]
- Kato S., Sekine S., Oh S.-W., et al. Construction of a human full-length cDNA bank. Gene (1994) 150:243–250.[CrossRef][ISI][Medline]
- Carninci P., Kvam C., Kitamura A., et al. High-efficiency full-length cDNA cloning by biotinylated CAP trapper. Genomics (1996) 37:327–336.[CrossRef][ISI][Medline]
- Suzuki Y., Yoshitomo-Nakagawa K., Maruyama K., Suyama A., Sugano S. Construction and characterization of a full length-enriched and a 5'-end-enriched cDNA library. Gene (1997) 200:149–156.[CrossRef][ISI][Medline]
- Zhu Y. Y., Machleder E. M., Chenchik A., Li R., Siebert P. D. Reverse transcriptase template switching: a SMART approach for full-length cDNA library construction. Biotechniques (2001) 30:892–897.[ISI][Medline]
- Schena M., Shalon D., Davis R. W., Brown P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science (1995) 270:467–470.
[Abstract/Free Full Text] - Lockhart D. J., Dong H., Byrne M. C., et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. (1996) 14:1675–1680.[CrossRef][ISI][Medline]
- Okubo K., Hori N., Matoba R., et al. Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression. Nat. Genet. (1992) 2:173–179.[CrossRef][ISI][Medline]
- Velculescu V. E., Zhang L., Vogelstein B., Kinzler K. W. Serial analysis of gene expression. Science (1995) 270:484–487.
[Abstract/Free Full Text] - Shiraki T., Kondo S., Katayama S., et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl. Acad. Sci. USA (2003) 100:15776–15781.
[Abstract/Free Full Text] - Hashimoto S., Suzuki Y., Kasai Y., et al. 5'-end SAGE for the analysis of transcriptional start sites. Nat. Biotechnol. (2004) 22:1146–1149.[CrossRef][ISI][Medline]
- Harbers M., Carninci P. Tag-based approaches for transcriptome research and genome annotation. Nat. Methods (2005) 2:495–502.[CrossRef][ISI][Medline]
- Ota T., Suzuki Y., Nishikawa T., et al. Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat. Genet. (2004) 36:40–45.[CrossRef][ISI][Medline]
- Imanishi T., Itoh T., Suzuki Y., et al. Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. (2004) 2:e162.[CrossRef][Medline]
- Gerhard D. S., Wagner L., Feingold E. A., et al. The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). Genome Res. (2004) 14:2121–2127.
[Abstract/Free Full Text] - Kawamoto S., Yoshii J., Mizuno K., et al. BodyMap: a collection of 3' ESTs for analysis of human gene expression information. Genome Res. (2000) 10:1817–1827.
[Abstract/Free Full Text] - Edgar R., Domrachev M., Lash A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. (2002) 30:207–210.
[Abstract/Free Full Text] - Brazma A., Parkinson H., Sarkans U., et al. ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. (2003) 31:68–71.
[Abstract/Free Full Text] - Kimura K., Wakamatsu A., Suzuki Y., et al. Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes. Genome Res. (2006) 16:55–65.
[Abstract/Free Full Text] - Suzuki Y., Taira H., Tsunoda T., et al. Diverse transcriptional initiation revealed by fine, large-scale mapping of mRNA start sites. EMBO Rep (2001) 2:388–393.[ISI][Medline]
- Modrek B., Resch A., Grasso C., Lee C. Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res (2001) 29:2850–2859.
[Abstract/Free Full Text] - Beaudoing E., Freier S., Wyatt J. R., Claverie J. M., Gautheret D. Patterns of variant polyadenylation signal usage in human genes. Genome Res (2000) 10:1001–1010.
[Abstract/Free Full Text] - Yelin R., Dahary D., Sorek R., et al. Widespread occurrence of antisense transcription in the human genome. Nat. Biotechnol (2003) 21:379–386.[CrossRef][ISI][Medline]
- Chen J., Sun M., Kent W. J., et al. Over 20% of human transcripts might form sense-antisense pairs. Nucleic Acids Res (2004) 32:4812–4820.
[Abstract/Free Full Text] - Schuler G. D. Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J. Mol. Med (1997) 75:694–698.[CrossRef][ISI][Medline]
- Kampa D., Cheng J., Kapranov P., et al. Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res (2004) 14:331–342.
[Abstract/Free Full Text] - Bertone P., Stolc V., Royce T. E., et al. Global identification of human transcribed sequences with genome tiling arrays. Science (2004) 306:2242–2246.
[Abstract/Free Full Text] - Kato S., Ohtoko K., Ohtake H., Kimura T. Vector-capping: a simple method for preparing a high-quality full-length cDNA library. DNA Res (2005) 12:53–62.[Abstract]
- Okayama H., Berg P. High-efficiency cloning of full-length cDNA. Mol. Cell. Biol. (1982) 2:161–170.
[Abstract/Free Full Text] - Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J. Mol. Biol. (1990) 215:403–410.[CrossRef][ISI][Medline]
- Schuler G. D., Epstein J. A., Ohkawa H., Kans J. A. Entrez: molecular biology database and retrieval system. Methods Enzymol (1996) 266:141–162.[ISI][Medline]
- Pruitt K. D., Tatusova T., Maglott D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res (2007) 35:D61–D65.[CrossRef][ISI][Medline]
- Chao A. Species richness estimation. Balakrishnan N., Read C. B., Vidakovic B., eds. (2005) Vol. 12. Wiley: New York. 7907–7916. Species Estimation and Applications. Encyclopedia of Statistical Sciences.
- Chao A., Lee S.-M. Estimating the number of classes via sample coverage. J. Am. Stat. Assoc (1992) 87:210–217.[CrossRef][ISI]
- Shen T.-J., Chao A., Lina C.-F. Predicting the number of new species in further taxonomic sampling. Ecology (2003) 84:798–804.[CrossRef][ISI]
- Klemenz R., Frohli E., Steiger R. H., Schafer R., Aoyama A. Alpha B-crystallin is a small heat shock protein. Proc. Natl. Acad. Sci. USA (1991) 88:3652–3656.
[Abstract/Free Full Text] - Dasgupta S., Hohman T. C., Carper D. Hypertonic stress induces alpha B-crystallin expression. Exp. Eye Res (1992) 54:461–470.[CrossRef][ISI][Medline]
- Chao C. C., Yam W. C., Lin-Chao S. Coordinated induction of two unrelated glucose-regulated protein genes by a calcium ionophore: human BiP/GRP78 and GAPDH. Biochem. Biophys. Res. Commun. (1990) 171:431–438.[CrossRef][ISI][Medline]
- Nasrin N., Ercolani L., Denaro M., Kong X. F., Kang I., Alexander M. An insulin response element in the glyceraldehyde-3-phosphate dehydrogenase gene binds a nuclear protein induced by insulin in cultured cells and by nutritional manipulations in vivo. Proc. Natl. Acad. Sci. USA (1990) 87:5273–5277.
[Abstract/Free Full Text] - Graven K. K., Troxler R. F., Kornfeld H., Panchenko M. V., Farber H. W. Regulation of endothelial cell glyceraldehyde-3-phosphate dehydrogenase expression by hypoxia. J. Biol. Chem (1994) 269:24446–24453.
[Abstract/Free Full Text] - Ida H., Boylan S. A., Weigel A. L., et al. EST analysis of mouse retina and RPE/choroid cDNA libraries. Mol. Vis (2004) 10:439–444.[ISI][Medline]
- Yamashita R., Suzuki Y., Wakaguri H., Tsuritani K., Nakai K., Sugano S. DBTSS: DataBase of human transcription start sites, progress report 2006. In: Nucleic Acids Res (2006) 34:D86–D89.
[Abstract/Free Full Text] - Linstedt A. D., Hauri H. P. Giantin, a novel conserved Golgi membrane protein containing a cytoplasmic domain of at least 350 kDa. Mol. Biol. Cell (1993) 4:679–693.[Abstract]
- Trinklein N. D., Aldred S. F., Hartman S. J., Schroeder D. I., Otillar R. P., Myers R. M. An abundance of bidirectional promoters in the human genome. Genome Res (2004) 14:62–66.
[Abstract/Free Full Text] - Li Y., Song M. G., Kiledjian M. Transcript-specific decapping and regulated stability by the human Dcp2 decapping protein. Mol. Cell. Biol (2008) 28:939–948.
[Abstract/Free Full Text] - Swaroop A., Zack D. J. Transcriptome analysis of the retina. Genome Biol (2002) 3:reviews1022.1–1022.4.
- Cai H., Del Priore L. V. Gene expression profile of cultured adult compared to immortalized human RPE. Mol. Vis. (2006) 12:1–14.[ISI][Medline]
- Xu Q., Modrek B., Lee C. Genome-wide detection of tissue-specific alternative splicing in the human transcriptome. Nucleic Acids Res. 30:3754–3766.
- Metzker M. L. Emerging technologies in DNA sequencing. Genome Res (2005) 15:1767–1776.

0.05% content at each redundancy.



