Skip Navigation


DNA Research Advance Access originally published online on November 14, 2006
DNA Research 2006 13(5):229-243; doi:10.1093/dnares/dsl011
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
13/5/229    most recent
dsl011v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrow Request Permissions
Google Scholar
Right arrow Articles by Noh, S.-J.
Right arrow Articles by Hur, C.-G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Noh, S.-J.
Right arrow Articles by Hur, C.-G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Kazusa DNA Research Institute.
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

TISA: Tissue-specific Alternative Splicing in Human and Mouse Genes

Seung-Jae Noh, Kyooyeol Lee, Hyojung Paik and Cheol-Goo Hur*

Bioinformatics Lab. Plant genomics center KRIBB, 52 Eoeun-dong, Yuseong-gu, Daejon, 305-333 Korea

Received 5 July 2006; revised 15 October 2006


    Abstract
 Top
 Abstract
 1. Introduction
 2 Materials and Methods
 3 Results and Discussion
 Acknowledgements
 REFERENCES
 
Alternative splicing (AS) is a mechanism by which multiple transcripts are produced from a single gene and is thought to be an important mechanism for tissue-specific expression of transcript isoforms. Here, we report a novel graphing method for transcript reconstruction and statistical prediction of tissue-specific AS. We applied three selection steps to generate the splice graph and predict the transcript isoforms: (i) a custom scoring rule for exon/intron sets, (ii) binomial statistics for selecting valid alternative splicing with a frequency of at least 1% for the predominant form and (iii) evaluation of transcript structure. We obtained 97 286 and 66 022 valid transcripts from 26 143 human and 27 741 mouse genes, respectively. In addition, we discovered 33 481 AS events for nine types of AS patterns in human. The statistical significance of tissue specificity for each gene, transcript and AS event was assessed based on EST tissue information, followed by a multiple testing correction procedure. In human, 12 711 genes, 16 016 transcripts and 1035 AS events were predicted to be tissue-specific (false discovery rate <0.01). This information on genes, transcript structures, AS events and their tissue specificities in human and mouse are freely accessible on the TISA website (http://tisa.kribb.re.kr/AGC/).

Key words: alternative splicing; genome-based clustering; splice graph; transcript variants; tissue-specific; gene ontology


    1. Introduction
 Top
 Abstract
 1. Introduction
 2 Materials and Methods
 3 Results and Discussion
 Acknowledgements
 REFERENCES
 
Alternative pre-mRNA splicing (AS) is a biological mechanism for producing multiple transcripts from a single gene. Genomic research has revealed that AS occurs in 30–60% of human genes.1Go–4Go Due to AS, higher eukaryotes can generate a number of protein-encoding transcripts from a single gene, thereby increasing the diversity of their proteomes.5Go Although proteins from one gene have the same or similar functions, they can have different characteristics, such as their binding properties, intracellular localization, enzymatic activity, regulation and stability.6Go Thus, knowledge of AS is important for understanding the complex biological systems in higher eukaryotes. Furthermore, AS in conjunction with transcriptional regulation determines tissue-specific expression of transcript isoforms of a gene.7Go One of the effects of AS is to control of RNA isoform stability by nonsense-mediated decay.8Go,9Go AS is of interest to pharmaceutical research because aberrant AS can lead to various genetic diseases and cancers.10Go,11Go

Several different strategies have been applied to AS analysis, including (i) EST mapping against mRNA12Go, (ii) mRNA/EST/protein mapping to the genome2Go–4Go,13Go–15Go, (iii) splicing microarray analysis16Go–18Go and (iv) ab initio machine learning approaches.19Go–21Go Among these methods, genome-based mRNA/EST alignment has been an important methodology for in silico detection of AS because ESTs and mRNAs are the most abundant resources and are considered direct evidence of the expressed genes. A number of algorithms to analyse AS using genomic alignment of expressed sequences have been reported. Kan et al.2Go classified links between exons within a gene according to four types of connectivity, represented them as a matrix, and then constructed transcripts using matrix analysis to identify the paths that can be connected.

Recently, several studies have suggested graphical methods to represent gene structure.4Go,22Go–25Go In these methods, exons and introns are verified from structural information obtained by genomic mapping, and splice graphs are constructed in which exons and introns constitute nodes and edges, respectively; however, generating all possible isoforms by graph traversal can produce many false transcripts. Lee's group22Go,26Go tried to overcome this problem by using a dynamic programming method. They suggested use of the heaviest bundling algorithm, in which a transcript with only partial alignment information is extended to the maximum length of the path with the maximum likelihood. On the other hand, in ECgene, Kim et al. discriminated the isoforms into three groups by considering the minimum set of sequence data that indicates the reliability of the corresponding transcript.

AS makes it possible to generate genetic products with different functions from a single gene. This process is controlled by numerous splicing regulatory factors, including serine/arginine-rich proteins that can bind to cognate sequence elements in mRNA precursor such as exonic splicing enhancer, exonic splicing silencer, intron splicing enhancer and intronic splicing silencer sequences, and regulate the splice site selections.27Go The outcome of the AS process can vary depending on several factors, such as the tissue, the developmental stage or the environmental conditions.1Go,5Go,28Go Therefore, global studies of condition-specific AS are required to precisely understand the function of genes. Xu et al.29Go examined tissue-specific AS using Bayesian statistics and EST tissue information, which focuses on only single AS events rather than on the transcript levels. Because each transcript may be the result of many sequential AS events, it is difficult to correlate the tissue specificity of an individual AS event with that of the transcript. In this study, we provide a methodology and results for reconstructing and classifying transcript variants from a splice graph that contains only valid AS sets according to binomial statistics (i.e. an observed frequency of >1% with a 95% confidence level). Moreover, we provide global information on the tissue specificity of transcript isoforms and the distinct AS events according to the organizational information from human and mouse ESTs.


    2 Materials and Methods
 Top
 Abstract
 1. Introduction
 2 Materials and Methods
 3 Results and Discussion
 Acknowledgements
 REFERENCES
 
2.1 Data preparation
Human and mouse mRNAs deposited in GenBank (2006 June version) were retrieved using the following search term: ‘human[Organism] AND biomol mRNA [Properties] NOT gbdiv est[Properties]’. More than 7.5 million human ESTs and ~4.6 million mouse ESTs were collected from dbEST (2006 June version). Repeat-masked genomic contig sequences of human and mouse were obtained from NCBI (Build 36 for human and mouse). The cDNA library information for human and mouse was downloaded from the Cancer Genome Anatomy Project (CGAP) website (http://cgap.nci.nih.gov/). Using CGAP library information, we classified all ESTs from 8616 human cDNA libraries into 46 tissues, except for ‘embryonic tissues’, ‘head and neck’, ‘pooled tissue’, ‘uncharacterized tissue’ and ‘whole body’. We also separated the histological status of the ESTs into ‘normal’, ‘cancer’, or ‘unknown’ according to the CGAP histology classification. We used only ESTs from the ‘normal’ library for analysis of tissue specificity. Gene annotation including gene ontology (GO) assignments was performed using GeneKeyDB.30Go

2.2 Pre-processing and genomic mapping
We generated a non-redundant set of contigs with known chromosomal locations and orientations and used it as a template for genomic alignment. PolyA stretches of mRNA/ESTs were removed, and the EST sequences were trimmed by 20 nt at both ends as previously described.4Go The presence of a polyA tail or polyA addition signal(s) for each query was recorded and used later for the detection of the gene terminus. We prepared a non-redundant set of mRNAs by running the ‘nrdb’ program. Expressed sequences longer than 100 nt were positioned in the genome by homology-based comparison using BLAT31Go and sim432Go. After fast alignment using BLAT, top five best matches for a query were refined by sim4 alignment using the genomic region including 10kb flanking regions at both ends of the original BLAT-mapped position. We saved only sequences with >60% coverage and >92% identity with ESTs or >96% identity with mRNAs (Fig. 1). If several genomic loci satisfied the criteria for a single query, the best match was regarded as the proper alignment.


Figure 1
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. Flow chart of the identification of the TISA. The results for each step of the human TISA analysis are shown.

 
2.3 Genomic clustering
By scanning each chromosome from the start, we compared the genomic alignment structures of full-length mRNAs whose alignment region overlapped with clustered mRNAs that share at least one common splice site. If the alignment showed a distinct structure, we created a new cluster. From the first cluster, we constructed a preliminary reconstructed structure (RS) that corresponds to a gene unit composed of exon/intron sets and their linkage relationships. During the clustering process, we extended the gene structure whenever new exons or introns were recognized. After clustering the full-length mRNAs, partial mRNAs and ESTs aligned within the preliminary gene boundary were retrieved and compared with the RS structure. We ignored clustering in the genomic region without full-length mRNA mapping, even though many ESTs or partial mRNAs were aligned in that region. We allowed all possible splice sequences as well as canonical splice sequences (CSSs; i.e. GT-AT, GC-AG and AT-AC); however, in the case of non-CSSs, we approved them only if they were confirmed by at least five ESTs. The query orientation along the genome was re-evaluated based on the splice site sequence of the query. For example, if a splice sequence set of a given query was mainly composed of GT-AG, it is obvious that the query orientation along the genome was forward. If a splice sequence set was mainly composed of CT-AC, the query was aligned in a reverse direction. The orientation of a gene was determined as the same as the query. Comparisons of the splice site position and its orientation were used as the main criteria for deciding whether a query should be included in a given cluster.

2.4 Establishment of exon and intron sets
To measure the confidence of RS exons and introns, we applied a custom scoring scheme. The reliability of RS exons depends on (i) whether the splice sequences of introns linked at both ends of the exon are CSS or non-CSS and (ii) how many queries support the given exon or intron structure. For example, the exons linked with CSS introns at both ends and supported by a number of mRNAs and ESTs have high scores. Because full-length mRNAs provide more reliable evidence than ESTs, we imposed a double score for the exons and introns derived from full-length mRNAs. The group of RS exons/introns whose hard-edge boundaries are within ±6 nt were merged into a single exon or intron and rescored. In the process of RS exon/intron merging, the exon or intron with the best score was selected as a representative. Finally, RS exons with scores higher than the cutoff were saved and used for linkage analysis. The minimal requirements for saving an exon were that it should be supported by at least one EST whose boundaries are linked to CSS introns at both ends. For example, if an internal exon is supported by two ESTs and is linked with non-CSS introns at both ends, its exon score will be 0.4 and 0.4 for the score of non-CSS linked exon boundary is given as 0.2. Because the score of this exon is lower than the cutoff value (for first exon, [–, 1.0]; internal exon, [1.0, 1.0]; last exon, [1.0, –]), it was discarded from the RS exon set. The same rule was applied to the intron scoring and selection (cutoff score = 1.0). We discarded internal exons shorter than 10 bp and introns shorter than 20 bp.

2.5 Evaluation and selection of alternative linkage relationships
We evaluated each mutually exclusive alternative linkage relationship using the exact binomial probability as previously reported.33Go Briefly, k is the number of ESTs supporting a particular alternative splice Y, and n is the total number of EST supporting all of mutually exclusive splice patterns including Y. The observed frequency of Y (= k/n) is tested to see if it is higher than a given threshold frequency f with statistical significance. The binomial probability Formula indicates the probability that alternative splice Y occurs at least k times in n trials with a given frequency f. This can be a good estimation of whether the observed frequency of Y exceeds a given threshold frequency relative to all of the alternative splices in mutual relationships. If Formula is <0.05, then the observed frequency of the alternative splice is expected to exceed the given threshold with high (>95%) confidence. The formula used for the calculation is as follows:

Formula 1(1)

We applied a 1% frequency threshold (f = 0.01) and a 95% CI as cutoff criteria. We allowed only the alternative linkage relationships satisfying the above condition in the splice graph.

2.6 Transcript reconstruction and classification
After establishing the non-redundant exon/intron sets statistically verified as described above, we found all possible paths by traveling the nodes (i.e. exons) through edges (i.e. introns) in a splice graph starting from the first exon (eTag = 1) and finishing at the last exon (eTag = 3). From each path, we reconstructed a transcript structure in a given gene. For each transcript, we compared its genomic structure one by one with all of query structures clustered in a gene and arranged the list of queries whose structure do not collide with the transcript structure (Fig. 2C). Then, we checked the coverage of each transcript structure supported by non-colliding queries and subdivided the transcripts into three classes. In Class I, all of the exons and introns in a transcript were supported by query structure(s) or its combinations. As for Class II, at least one of exons or introns in a transcript was not validated. In Class III, there is no supporting evidence for a transcript. Class I transcripts represent most reliable set of transcripts generated from the splice graph. Class III transcripts are least reliable in which all the exon–exon connections are only theoretically possible in a graph.


Figure 2
View larger version (34K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2. Schematic diagram of the genomic cluster, splice graph and the reconstruction and classification of transcript variants in human SCARF1 gene. In this gene cluster, 13 mRNAs (full-length: dark green, partial: bright green) and 32 ESTs (grey) were clustered as a single gene unit. There are seven AS events in total. Among them, only four AS events are valid and finally remained in the splice graph. (A) The gene structure and alternative linkage relationships were deduced from the genomic alignment structures of 45 queries. Exonic regions are indicated with solid boxes. Alternative splice borders and alternative exonic regions are also represented as dashed lines and dashed boxes. The two purple vertical lines in the last exon represent polyA addition signals (i.e. AATAAA). The ‘A’ at the end of alignments for some of queries represents the presence of a poly A tail in the original untrimmed query. (B) Full splice graph. The boxes with numbers indicate exons and the arrows represent introns. Statistically unverified alternative linkages and corresponding exons were represented as dashed boxes and dashed lines respectively. (C) Reconstruction of transcript variants and their classification. A total of eight transcripts can be generated by all possible traversals from the splice graph. The supporting queries for each transcript isoform are indicated. For all the eight transcript structures 23 queries are common; however, those only cover 5' part of the gene and the last exon. After examination of the supporting query structures for each isoform, each transcript isoform was classified into Classes I, II or III. In this example, there is no Class III transcript. Exons and introns not supported by clustered queries for a given transcript in Class II are represented as blank boxes and dashed lines, respectively. For example, in transcript isoform 7 and 8, which are supported by only common queries, the internal regions are not supported by the queries. Thus, these are represented by dashed lines and blank boxes, respectively. Query 22, 25, 26 could be not included in any of the isoforms and was therefore treated as singletons.

 
2.7 Statistical analyses for tissue specificity
If a particular gene has a large number of ESTs derived from a specific tissue, we assumed that the gene is likely to be expressed at a higher level in that tissue. We applied the statistical analysis designed by Audic and Claverie34Go to analyse the tissue specificity of all Class I transcripts and genes. We performed Fisher's exact test to assess the tissue specificity of AS events.

2.7.1 Audic's test
In Audic's test, the number of successfully clustered ESTs from all genes is N, the number of ESTs from a specific tissue is N2, the number of ESTs from all other tissues is N1, the total number of ESTs clustered on a given gene is n, the number of ESTs from a specific tissue in a given gene is y, and the number of ESTs from all other tissues in a given gene is x. Thus, N1 + N2 = N and x + y = n. We set the null hypothesis, Ho, i.e. there is no difference between the ratio of the specific tissue count for a gene and the number for ESTs in that tissue (y/N2) and the ratio of all other tissue counts in a gene and the number of ESTs in all other tissues (x/N1). Given that x was derived from N1, the conditional probability that y would be derived from N2 is as follows:34Go

Formula 2(2)
The P-value was calculated using the following equation:

Formula 2(3)

If the P-value is <0.01, we deny the null hypothesis with 99% confidence, and the EST count in that tissue for a given gene is significantly different than the EST count in all other tissues, indicating that the gene is expressed at a higher level in that tissue.

2.7.2 Benjamini–Yekutieli method for multiple test correction
To correct significance estimates for multiple hypothesis testing, we used the false discovery rate (FDR) method of Benjamini and Yekutieli,35Go which calculates a new test statistic that estimates the portion of Type I errors within a group of hypotheses meeting a significance cutoff. The FDR is the expected proportion of erroneously rejected null hypotheses among the rejected ones. The BY procedure is performed as follows. The individual P-values are sorted from smallest to largest: P(1) ≤ P(2) ≤ ··· ≤ P(m). Starting from the largest P-value (P(m)), P(i) is compared with a i/m/(1/1+1/2+···+1/m), where {alpha} is a desired level of FDR. This is continued as long as P(i) > {alpha}i/m/(1/1+1/2+···+1/m). Next, k is set as the first time P(i) ≤ {alpha}i/m/(1/1+1/2+···+1/m). The hypotheses corresponding to the smallest k P-values are then declared to be significant. We set 0.01 as the desired value of {alpha} and rejected the k hypotheses with P-values less than P(k) in each evaluation of tissue specificity. The Benjamini–Yekutieli method considers the random dependence between the tests. We reasoned that the tissue specificity of the gene (or transcripts) in a particular tissue can be affected by the tissue specificities in other tissues for a given gene.

2.8 GO analysis
The full list of GO terms was downloaded from the Gene Ontology Consortium, and the directed acyclic graph structure of GO was rebuilt as similar with AmiGO (http://godatabase.org/). For each GO term, all associated child GO terms in the downstream branch nodes were gathered recursively by searching the whole GO tree structure for related entries with either an ‘is a’ or a ‘part of’ relationship. A total of 4134 GO terms were found to be associated with 9612 alternatively spliced human genes. Fisher's exact test using a 2 x 2 contingency table was applied to investigate whether there was any bias for a certain AS pattern amongst the nine AS patterns for a given GO term.36Go In the table, the two columns are ‘genes with the AS pattern’ and ‘genes without the AS pattern,’ and the two rows are ‘having the GO term’ and ‘not having the GO term.’ Similarly, we carried out tests for the enrichment of specific GO terms in the group of tissue-specific genes or transcripts for each tissue. In these cases, the two columns are ‘genes (or transcripts), tissue-specific in a certain tissue’ and ‘genes (or transcripts), tissue-specific in all other tissues’, and the two rows are ‘having the GO term’ and ‘not having the GO term.’


    3 Results and Discussion
 Top
 Abstract
 1. Introduction
 2 Materials and Methods
 3 Results and Discussion
 Acknowledgements
 REFERENCES
 
The overall process and the results of each step for human are summarized in Fig. 1. The process can be divided into three main steps: (i) construction of a splice graph based on genomic alignment of mRNA/EST; (ii) reconstruction of transcript variants and AS pattern analysis from the splice graph; and (iii) statistical analysis of the tissue specificity of the genes, transcripts and AS events using the tissue categories of the supporting ESTs.

3.1 Splice graph construction from the genomic alignments of mRNAs/ESTs
Using genome-based mRNA/EST mapping, we successfully localized ~90% of mRNA/ESTs in the genome. We developed our own clustering program using genomically aligned full-length mRNA as the standard. Because our study focuses on AS analysis and forecasting of tissue-specific AS rather than on gene prediction, we conducted clustering in the genomic region with full-length mRNA mapping. Many partial mRNAs and ESTs aligned with the genome were discarded because a lack of full-length mRNA data. As a result, 178 807 (74%) mRNAs and 5 605 025 (72%) ESTs were finally incorporated into the gene clusters of the human genome. A total of 89 814 full-length mRNAs corresponded to 26 143 genes (Fig. 1). Multi-exon ESTs accounted for 58% (3 261 926) of all clustered ESTs.

RS is composed of exon/intron sets and exon–exon linkage relationships. Based on the genomic structure of standard mRNA in a cluster, newly added mRNAs or ESTs were included into the RS if their alignment structures shared at least one splice site with the RS. We scored each exon and intron using our scoring rule. The exon score is an index of the confidence of the exon. We controlled the score so that the exon has a higher splice junction score when it has more supporting queries (mRNA/ESTs) and when the introns connected to the end of the exon have CSSs (GT-AG, GC-AG and AT-AC) that are more reliable. We gave differential weights to full-length mRNAs and ESTs, considering the qualitative difference in their sequences. This rule was also applied to intron scoring. Our algorithm also allowed non-CSS introns if they were supported by at least five ESTs. Exons or introns whose boundaries were within ±6 bp at both ends were merged into a single exon or intron, and its score was recalculated. The exon or intron with the best score was chosen as representative. The merging process corrects erroneous splicing that might occur during BLAT mapping or as a result of poor sequence quality of ESTs. Consequently, we were able to obtain reliable non-redundant exon/intron sets by exon/intron merging, rescoring and selection by using a cutoff score. Based on these steps, we constructed the splice graph and conducted the AS analysis.

3.2 Splice graph analysis
Our splice graph algorithm for detecting AS is essentially the same as previously reported.24Go,25Go The largest differences are in (i) the application of the selection step for statistically reliable alternative linkage relationships during splice graph construction and (ii) the classification of transcript variants.

3.2.1 Selection of statistically reliable alternative splices
If all of the alternative splices that can be obtained using erroneous EST data are allowed, one or more alternatively spliced forms could be detected for nearly all genes examined as long as they have sufficient EST coverage. Therefore, a method is needed to discriminate the true isoforms from erroneous splicing resulting from aberrant splicing or misalignment. Kan et al.33Go suggested a statistical method to solve this problem, wherein alternative splices are selected according to the frequency at which they occur with a high confidence level. One example is shown in Fig. 2. There are 45 supporting queries (13 mRNAs and 32 ESTs) for the human SCARF1 (scavenger receptor Class F, member 1) gene cluster. Among 13 mRNAs queries, 12 are full-length. Seven AS events can be defined from the comparisons of their alignment structures: two alternative first exon events (E1-E3 versus E2-E3, E4-E6 versus E5-E6), complex splice site variation (E7-E13 versus E12-E13), two 3'-splice variations (E8-E10 versus E8-E11, E15-E16 versus E15-E17), intron retention (E8-E10 versus E9) and alternative last exon (E13-E15 versus E14). The seven AS events and the number of sequences supporting each isoform are summarized in Table 2. Four of them (3, 4, 5 and 7) have sufficient sequence evidence for the minor isoform relative to the major isoform. AS events 1, 2 and 6, however, have only one supporting query for each, whereas the query count for the each counterpart major isoform is much larger, yielding a low observed frequency for the minor isoform. To determine whether the significance of the observed frequency (for example, 1/33 for AS event 1) of the minor isoform is higher than the threshold (frequency of 1%) or not, we calculated P-value adopting binomial statistics. The binomial probability of each AS event was calculated based on the number of supporting queries for the isoform. As shown in Table 2, the P-value for the minor isoform of AS event 1 exceeding the given threshold (1%) was 0.282. As similar, P-values for AS events 2 and 6 show the value higher than the threshold, whereas the P-values of the other four AS events were under 0.05 (i.e. 95% confidence). Thus, these three events were excluded from the splice graph (Fig. 2B).

3.2.2 Reconstruction of transcript isoforms and their classification
After the selection of a reliable alternative splice set, we built the adjacency list for the directed acyclic graph and constructed all possible combinations of paths through graph traversal as described previously.24Go,25Go Each path represents a transcript variant. By comparing the alignment structures of each transcript and its supporting queries, we subdivided the transcript isoforms as follows: Class I, all of the exons and introns in the transcript were supported by query structures; Class II, some of exons or introns were not validated; and Class III, none of the exons or introns had supporting evidence (Fig. 2C). We also calculated the total score of each transcript based on the number of supporting queries, considering the exon/intron coverage and sorted transcripts according to their scores. In the example of Fig. 2, a total of eight transcripts were constructed by graph traversal: isoforms 1–5 were Class I; 6–8 were Class II.

Table 1 shows the results of the analysis of the AS and gene modelling from the graph algorithm based on genomic clustering data. In the human genome, 13 560 (66%) genes were found to have >1 transcript, and in the mouse genome, 11 752 (56%) genes were estimated to have at least one transcript. A total of 402 995 transcripts were formed, of which 97 286 (24%) in human and 33% (64 255 of 197 361) in mouse were Class I. On average, 6.5 and 5.0 isoforms were generated from one AS gene in human and mouse respectively. The AS patterns and tissue specificities of the transcripts were examined only for Class I transcripts.


View this table:
[in this window]
[in a new window]

 
Table 1. Statistics of the reconstructed genes and transcripts in human and mouse

 


View this table:
[in this window]
[in a new window]

 
Table 2. Evaluation of AS events in human SCARF1 gene by applying binomial statistics

 
Given that a transcript is an output of several possible combinations of individual AS events in an individual gene, the prediction and reconstruction of transcript isoforms is another challenging problem. The simplest method is to generate all possible combinations from the graph traversal. As noted above, however, this may produce many false transcripts. Thus, we evaluated its reliability according to EST evidence, and we classified each set of transcripts into three groups. As other good measures to determine the transcript reliability, Kim et al. 25Go classified the transcripts according to the minimum number of sequences required to fully encompass the transcript structure and Xing et al.26Go suggested the heaviest bundling algorithm, in which a transcript with only partial alignment information is extended to the maximum length of the path with the maximum likelihood.

How many genes generate multiple transcripts and how many transcripts exist in humans are very important questions that remain unanswered. Exon-junction microarray analysis shows that ~75% of human multi-exon genes undergo AS.16Go In our case, 52% of all human genes had multiple transcripts, although the percentage of AS genes increases to 66% if we consider only multi-exon genes with EST support. The percentage of AS genes is very similar to the results of other studies based on graph algorithms;13Go,23Go however, the number of transcript variants from our study is smaller than other reported values. The low outcome is largely due to the application of strict selection criteria for the AS relationships, namely, a frequency cutoff of 0.01 (1%) with 95% statistical confidence. Without adopting the statistical selection for AS events results in >200 000 Class I transcripts and a total of ~1 000 000 transcripts for all three classes, which is comparable to the findings of other studies in the ESTGenes, ECgene, ASG and ASD databases (Table 4). In these cases, however, the splice graph will contain many AS events with very low frequency (<0.001) and many transcripts would be generated from the splice graph. We could not guarantee the existence of these AS events in real tissues or organs. Moreover, as mentioned previously,33Go if we allow all of the splice variations based on ESTs, we found that >95% of the multi-exon genes with >300 EST supports undergo AS events. The expectation that almost all genes undergo AS if there are sufficient ESTs does not seem to be reasonable. Thus, we applied a two-step selection strategy as described above. Using our strategy, we found that the percent of AS genes converged to ~75% even though EST coverage in a gene cluster increased to >1000 (data not shown). Therefore, we suppose the Class I transcripts with small number and AS events defined in our study are the reliable sets confidently selected from all possible combinations.

3.3 Detection of AS events
For each gene, we defined and examined nine types of AS patterns by exon–exon and exon–intron linkage analysis of Class I transcripts: cassette exon, 5'-splice site variation, 3'-splice site variation, mutually exclusive exon, multiple cassette exon, overlapping exon, composite splice site variation, intron retention and multiple intron retention (Fig. 3). In human, cassette exon was the most frequent AS event, followed by intron retention and multiple cassette exon. Table 3 summarizes the results of the nine AS patterns. Alternative exons in the defined AS patterns accounted for ~41% of all alternative exons in our study. Characterization and investigation of more complex AS patterns would allow discovery of additional AS events.


Figure 3
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3. Classification of the AS patterns characterized in TISA. The nine types of mutually exclusive AS patterns characterized in this study are represented. Boxes and lines indicate exons and introns, respectively. Alternative exons are indicated as shaded boxes.

 


View this table:
[in this window]
[in a new window]

 
Table 3. The number of TISA events predicted in human and mouse (P < 0.01)

 


View this table:
[in this window]
[in a new window]

 
Table 4. Comparison of current human AS databases

 
Because the method depends on mRNA/EST evidence, it cannot identify novel AS events if there is no sequence evidence. Usually, ESTs are generated by single-pass sequencing of cDNA clones at both ends, resulting in relatively low coverage in the genes' internal region. Although >7.5 million human ESTs are in the public database, we could not determine whether that is sufficient for the AS detection. To overcome this problem, several approaches have been employed, including splice microarrays to detect AS,16Go–18Go machine learning techniques such as support vector machine or hidden Markov models,19Go–21Go and application of comparative genomics.37Go Such methods can be used to detect AS in the internal regions or in the genes with little or no sequence evidence; however, ab initio prediction accompanying the machine learning technique is in a preliminary stage, so the research has so far been conducted on only a few simple AS patterns, such as cassette exon and/or intron retention. As for the splice microarray, its weakness is that the experiment is possible only for already known exon sets, and the detection range is limited to a few of simple AS events, similar to the results obtained so far with the machine learning approach. Therefore, such methods can be possible using the information obtained from EST-based analysis and can enhance the overall sensitivity of AS detection complementarily with the EST-based method.

3.4 Statistical analysis of the tissue specificity of genes and transcript variants
Using tissue information for ESTs, we have constructed digital expression profiles for each gene and transcript. From 8616 cDNA libraries, there were 7 544 537 ESTs classified into 46 human tissues. Individual ESTs were classified as normal- or cancer-derived according to their histological status. We used only ‘normal’ ESTs. If the proportion of ESTs from a specific tissue was significantly higher than that from other tissues, the gene would have a higher probability of tissue-specific expression in that tissue. In the example shown in Fig. 4, there are two possible isoforms in human TFAP2B (transcription factor AP-2 beta) gene. Among 36 supporting ESTs, 24 are common to both isoforms and 4 are singletons. For the tissue specificity of the gene in retina tissue, a 2 x 2 contingency table was constructed (Fig. 4C) and the P-value was calculated as described in the Materials and Methods. In this case, the sample gene is probably retina-specific because its P-value was much <0.01 (99% confidence level). For the other tissues, the same evaluations were performed sequentially. The contingency table of the AS event was constructed in the same way (Fig. 4D) as for retina tissue (i.e. cassette exon event in E2 and the P-value calculated using Fisher's exact test). In this case, the cassette exon event is not predicted to be retina-specific in 99% confidence level (however it is significantly retina-specific in 95% confidence level).


Figure 4
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4. Analysis of tissue specificity of the gene, transcripts and AS event. (A) The gene cluster of human TFAP2B with two isoforms. There are two transcript isoforms possible from the gene cluster, with 34 supporting ESTs. A cassette exon event in the second exon is marked with a dashed rectangle. In this example, we considered only ESTs from normal libraries. The tissue from which EST was derived is indicated. (B) Tissue distribution table of ESTs for the gene and isoforms. (C) The 2 x 2 contingency table for tissue specificity calculation of the gene in retina tissue. The numbers of total ESTs and retina-specific ESTs were derived from the human clustering data. The P-value calculated by Audic's method is indicated. (D) The contingency table for tissue specificity of the AS event in retina tissue. The P-value calculated by Fisher's exact test is indicated. (E) Tissue distribution table of ESTs for isoform 1. (F) The contingency tables for tissue specificity calculation of the transcript isoforms in retina tissue in Cases 1 and 2 (see the text for details). The P-values calculated by Audic's method are indicated for each case. Singleton ESTs were included at tissue-counting for the gene but were excluded at counting for transcript isoforms.

 
The tissue specificity of the transcript level is determined using essentially the same approach as the tissue specificity of the genes, although one EST could be included in both transcripts. To handle this multiple inclusion problem, we tried two different tissue-EST counting methods. If the multiple inclusion of an EST is ignored, it is possible to obtain a tissue-EST table like that shown for Case 1 in Fig. 4E. The default method is to calculate the tissue specificity as for the genes from the table (Fig. 4F, Case 1). To improve this method, we introduced a weighting factor that provides a different score for EST tissue counting (Fig. 4E, Case 2). If an EST is derived from a certain tissue and is included in n transcript isoforms, its tissue contribution is given as 1/n. The tissue count was rounded off to the nearest integer prior to the P-value calculation. In our study, we chose the Case 2 counting method for tissue-specific analysis of transcripts.

By applying Audic's test and the Benjamini–Yekutieli method, we predicted 12 711 tissue-specific genes and 16 016 tissue-specific transcripts (Table 5) for the condition of FDR <0.01 and a tissue-EST count ≥3. In mouse, 15 895 genes and 19861 transcripts were predicted to be tissue-specific in 44 tissues. It should be noted that a gene or a transcript can be predicted to be tissue-specific in more than one tissue. Among tissue-specific genes and transcripts in human, 7522 genes (59%) and 11 592 transcripts (72%) were specific only to a single tissue. We have checked the conservation of tissue-specificity between orthologous human and mouse gene pair. Orthologous groups were assigned to each gene according to HomoloGene (NCBI Build48.1). Of 4965 orthologous pairs specific to a single tissue, 11.7% (584 genes) were conserved in its tissue-specificity. In particular, testis-specific genes occupied a major portion (58%) among these tissue-conserved genes (Fig. 5).


Figure 5
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5. Tissue- and GO term-distribution of 584 tissue-specific genes evolutionarily conserved between human and mouse. (A) Tissue distribution. (B) GO term distribution in biological process section.

 


View this table:
[in this window]
[in a new window]

 
Table 5. The number of tissue-specific genes and transcripts predicted in human and mouse

 
3.5 GO analysis
In human, 14 570 genes were assigned to 4666 distinct GO terms across the three functional categories: biological process, molecular function and cellular components. Among these, 9612 alternatively spliced genes were assigned to 4134 GO terms and 4958 non-alternatively spliced genes were assigned to 2615 GO terms. In mouse, 16 271 genes were assigned to 4613 GO terms and 8519 alternatively spliced genes to 3770 GO terms. GO analysis was carried out to determine whether there was any functional bias of the gene groups for a certain AS pattern. For example, we examined whether the genes with a cassette exon event showed significant enrichment in any of the GO terms. By applying Fisher's exact test, we discovered the differences between the groups of the genes in the different AS types. The selection criteria for the AS pattern and GO term were as follows: GO depth in the tree structure ≥4, number of genes in a given AS pattern for a given GO term ≥10 and P-value ≤0.01. The results are listed at Table 6. Interestingly, genes involved in protein modification including phosphorylation and dephosphorylation had relatively more cassette exon events than other AS events. Also, multiple cassette exon events were substantially enriched with genes involved in apoptosis and mitosis. Mutually exclusive exon events were substantially enriched with ion transport and actin cytoskeleton genes. The genes for chromosome structure formation and modification were coupled with 5'-splice site variation. Genes involved in nucleotide metabolism were the most common in the 3'-splice site variation. For the other AS patterns, there were no significant couplings with any gene group. A previous study also reported the statistical differences in the AS types commonly arising in different tissues.38Go It is interesting that there is an apparent bias in the distributions of genes between different AS patterns and tissues.


View this table:
[in this window]
[in a new window]

 
Table 6. Significant GO terms enriched in genes with each AS pattern in human

 
Each tissue has its own characteristics, resulting in a unique biological function, morphology and physiology. These characteristics are due to the differential expressions of genes and/or transcripts. Similar to the analysis of GO terms for different AS patterns, we searched for GO terms significantly enriched within the group of tissue-specific transcripts. Significantly enriched GO terms for the group of tissue-specific transcripts in each tissue are in accordance with our previous knowledge. For example, GO terms closely related to brain functions (i.e. central nervous system development, synaptic vesicle, ion channel activity and neurotransmitter transport) were found to be enriched in brain-specific transcripts (See Supplemental Tables 1 and 2). Similar profiles were obtained for the tissue-specific genes. These results provide indirect evidence of the reliability of our approach for identifying tissue-specific genes and transcripts.

In conclusion, we have presented a detailed algorithm for genome-based clustering and development of a splice graph to identify AS events and variable transcripts derived from individual genes. Our algorithm reduced the possibility of false transcript formation because we applied a scoring rule and a statistical method for the selection of valid AS sets so that only exon/intron sets and alternative linkages with sufficient evidence were selected and used to build the splice graph. We also carried out additional classification of the final transcript based on the mRNA/EST query alignment information supporting the transcript structure. We then subdivided the transcript into one of the three classes according to its reliability. We also provided statistical applications for identifying tissue-specific genes, transcripts and AS events based on EST tissue information. By adapting very conservative criteria for genomic mapping and allowing statistically significant alternative splice relationships, we identified AS information and transcript variants with fewer false positives. The GO analyses support our methodology and tissue specificity results, suggesting that we correctly identified the set of tissue-specific transcripts.

Our method constitutes an integrated system and can be easily applied to other organisms for which full genome sequences are available and the number of complete mRNAs is sufficient. In this study, we applied our system to human and mouse sequences. The results obtained from this study might help clarify the complex biological nature of AS and the differential expression of transcript variants in many tissues. This can also provide information about the reliable exon sets, including potential AS exons, which can be employed in other AS detection methods, including the design of a splice microarray and a training set for the machine learning approach.


    Acknowledgements
 Top
 Abstract
 1. Introduction
 2 Materials and Methods
 3 Results and Discussion
 Acknowledgements
 REFERENCES
 
This work was supported by grants (nos. FGM0300411 and FGM0300512) from the Korean Science and Engineering Foundation to the international joint research project (M6-0401-00-0178) between the Korean Research Institute of Bioscience and Biotechnology and Weizmann Institute of Science in Israel.

Supplementary Data: Supplementary material is available online at http://www.dnaresearch.oxfordjournals.org.


    Footnotes
 
*To whom correspondence should be addressed. Tel: +82-42-879-8560, Fax: +82-42-879-8569, E-mail: hurlee{at}kribb.re.kr

Communicated by Kenta Nakai


    REFERENCES
 Top
 Abstract
 1. Introduction
 2 Materials and Methods
 3 Results and Discussion
 Acknowledgements
 REFERENCES
 

  1. Modrek, B. and Lee, C. 2002, A genomic view of AS, Nature Genet., 30, 13–19.[CrossRef][Web of Science][Medline]
  2. Kan, Z., Rouchka, E.C., Gish, W., and States, D. 2001, Gene structure prediction and AS analysis using genomically aligned ESTs, Genome Res., 11, 889–900.[Abstract/Free Full Text]
  3. Modrek, B., Resch, A., Grasso, C., and Lee, C. 2001, Genome-wide detection of alternative slicing in expressed sequences of human genes, Nucleic Acids Res., 29, 2850–2859.[Abstract/Free Full Text]
  4. Eyras, E., Caccamo, M., Curwen, V., and Clamp, M. 2004, ESTGenes: AS from ESTs in Ensembl, Genome Res., 14, 976–987.[Abstract/Free Full Text]
  5. Maniatis, T. and Tasic, B. 2002, Alternative pre-mRNA splicing and proteome expansion in metazoans, Nature, 418, 236–243.[CrossRef][Medline]
  6. Stamm, S. 2002, Signals and their transduction pathways regulating AS: a new dimension of the human genome, Hum. Mol. Genet., 11, 2409–2416.[Abstract/Free Full Text]
  7. Kornblihtt, A. R., De La Mata, M., Fededa, J.P., et al. 2004, Multiple links between transcription and splicing, RNA, 10, 1489–1498.[Abstract/Free Full Text]
  8. Lewis, B. P., Green, R.E., and Brenner, S. E. 2003, Evidence for the widespread coupling of AS and nonsense-mediated mRNA decay in humans, Proc. Natl. Acad. Sci.USA, 100, 189–192.[Abstract/Free Full Text]
  9. Cartegni, L., Chew, S., and Krainer, A. 2002, Listening to silence and understanding nonsense: exonic mutations that affect splicing, Nature Rev., 3, 285–298.
  10. Bracco, L. and Kearsey, J. 2003, The relevance of alternative RNA splicing to pharmacogenomics, Trends Biotechnol., 21, 346–353.[CrossRef][Web of Science][Medline]
  11. Gracia-Blanco, M. A., Baraniak, A. P., and Lasda, E. L. 2004, AS in disease and therapy, Nature Biotech., 22, 535–546.[CrossRef][Web of Science][Medline]
  12. Brett, D., Hanke, J., Lehmann, G., et al. 2000, EST comparison indicates 38% of human mRNAs contain possible alternative splice forms, FEBS Lett., 474, 83–86.[CrossRef][Web of Science][Medline]
  13. Thanaraj, T. A., Stamm, S., Clark, F., et al. 2004, ASD: the AS Database, Nucleic Acids Res., 32, D64–D69.[Abstract/Free Full Text]
  14. Gupta, S., Zink, D., Korn, B., et al. 2004, Genome wide identification and classification of AS based on EST data, Bioinformatics, 20, 2579–2585.[Abstract/Free Full Text]
  15. Huang, H. D., Horng, J. T., Lee, C. C., Liu, B.J. 2003, ProSplicer: a database of putative AS information derived from protein, mRNA and expressed sequence tag sequence data, Genome Biol., 4, R29.[CrossRef][Medline]
  16. Johnson, J. M., Castle, J., Garrett-Engele, P., et al. 2003, Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays, Science, 302, 2141–2144.[Abstract/Free Full Text]
  17. Pan, Q., Shai, O., Misquitta, C., et al. 2004, Revealing global regulatory features of mammalian AS using a quantitative microarray platform, Mol. Cell, 16, 929–941.[CrossRef][Web of Science][Medline]
  18. Fehlbaum, P., Guihal, C., Bracco, L., and Cochet, O. 2005, A microarray configuration to quantify expression levels and relative abundance of splice variants, Nucleic Acids Res., 33, e47.[Abstract/Free Full Text]
  19. Cawley, S. and Pachter, L. 2003, HMM sampling and applications to gene finding and AS, Bioinformatics, 19, ii36–ii41.[Abstract]
  20. Dror, G., Soreck, R., and Shamir, R. 2004, Accurate identification of alternatively spliced exons using support vector machine, Bioinformatics, 21, 897–901.
  21. Ratsch, G., Sonnenburg, S., and Scholkopf, B. 2005, RASE: recognition of alternatively spliced exons in C. elegans, Bioinformatics, 21, i369–i377.[Abstract]
  22. Lee, C. 2003, Generating consensus sequences from partial order multiple sequence alignment graphs, Bioinformatics, 19, 999–1008.[Abstract/Free Full Text]
  23. Leipzig, J., Pevzner, P., and Heber, S. 2004, The AS Gallery (ASG): bridging the gap between genome and transcriptome, Nucleic Acids Res., 32, 3977–3983.[Abstract/Free Full Text]
  24. Lee, B. T. K., Tan, T. W., Ranganathan, S. 2004, DEDB: a database of Drosophila melanogaster exons in splicing graph form, BMC Bioinformatics, 5, 189.[CrossRef][Medline]
  25. Kim, N., Shin, S., and Lee, S. 2005, ECgene: Genome-based EST clustering and gene modeling for AS, Genome Res., 15, 566–576.[Abstract/Free Full Text]
  26. Xing, Y., Resch, A., and Lee, C. 2004, The multiassembly problem: Reconstructing multiple transcript isoforms from EST fragment mixtures, Genome Res., 14, 426–441.[Abstract/Free Full Text]
  27. Hertel, K. J., Lynch, K. W., Maniatis, T. 1997, Common themes in the function of transcription and splicing enhancers, Curr. Opin. Cell Biol., 9, 350–357.[CrossRef][Web of Science][Medline]
  28. Ladd, A.N. and Cooper, T.A. 2002, Finding signals that regulate AS in the post-genomic era, Genome Biol., 3, 0008.
  29. Xu, Q., Modrek, B., Lee, C. 2002, Genome-wide detection of tissue-specific AS in the human transcriptome, Nucleic Acids Res., 30, 3754–3766.[Abstract/Free Full Text]
  30. Kirov, S. A., Peng, X., Baker, E., et al. 2005, GeneKeyDB: A lightweight, gene-centric, relational database to support data mining environments, BMC Bioinformatics, 6, 72.[CrossRef][Medline]
  31. Kent, W.J. 2002, BLAT-The BLAST-Like Alignment Tool, Genome Res., 12, 565–664.
  32. Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., Miller, W. 1998, Computer program for aligning a cDNA sequence with a genomic DNA sequence, Genome Res, 8, 967–974.
  33. Kan, Z., Startes, D., Gish, W. 2002, Selecting for functional alternative splices in ESTs, Genome Res., 12, 1837–1845.[Abstract/Free Full Text]
  34. Audic, S. and Claverie, J. M. 1997, The significance of digital gene expression profiles, Genome Res., 7, 986–995.[Abstract/Free Full Text]
  35. Benjamini, Y. and Yekutieli, D. 2001, The control of the false discovery rate in multiple hypotheses testing under dependency, Annal. Stat., 29, 1165–1188.[CrossRef]
  36. Zeeberg, B. R., Feng, W., Wang, G., et al. 2003, GoMiner: a resource for biological interpretation of genomic and proteomic data, Genome Biol., 4, R28.[CrossRef][Medline]
  37. Sorek, R., Shemesh, R., Cohen, Y., et al. 2004, A non-EST-based method for exon-skipping prediction, Genome Res., 14, 1617–1623.[Abstract/Free Full Text]
  38. Yeo, G., Holste, D., Kreiman, G., and Burge, C. B. 2004, Variation in AS across human tissues, Genome Biol., 5, R74.[CrossRef][Medline]

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
P. de la Grange, L. Gratadou, M. Delord, M. Dutertre, and D. Auboeuf
Splicing factor and exon profiling across human tissues
Nucleic Acids Res., January 27, 2010; (2010): gkq008v1 - gkq008.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
E. Melamud and J. Moult
Stochastic noise in splicing machinery
Nucleic Acids Res., August 1, 2009; 37(14): 4873 - 4886.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
13/5/229    most recent
dsl011v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrow Request Permissions
Google Scholar
Right arrow Articles by Noh, S.-J.
Right arrow Articles by Hur, C.-G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Noh, S.-J.
Right arrow Articles by Hur, C.-G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?