DNA Research Advance Access originally published online on November 14, 2006
DNA Research 2006 13(5):229-243; doi:10.1093/dnares/dsl011
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TISA: Tissue-specific Alternative Splicing in Human and Mouse Genes
Bioinformatics Lab. Plant genomics center KRIBB, 52 Eoeun-dong, Yuseong-gu, Daejon, 305-333 Korea
Received 5 July 2006; revised 15 October 2006
| Abstract |
|---|
|
|
|---|
Alternative splicing (AS) is a mechanism by which multiple transcripts are produced from a single gene and is thought to be an important mechanism for tissue-specific expression of transcript isoforms. Here, we report a novel graphing method for transcript reconstruction and statistical prediction of tissue-specific AS. We applied three selection steps to generate the splice graph and predict the transcript isoforms: (i) a custom scoring rule for exon/intron sets, (ii) binomial statistics for selecting valid alternative splicing with a frequency of at least 1% for the predominant form and (iii) evaluation of transcript structure. We obtained 97 286 and 66 022 valid transcripts from 26 143 human and 27 741 mouse genes, respectively. In addition, we discovered 33 481 AS events for nine types of AS patterns in human. The statistical significance of tissue specificity for each gene, transcript and AS event was assessed based on EST tissue information, followed by a multiple testing correction procedure. In human, 12 711 genes, 16 016 transcripts and 1035 AS events were predicted to be tissue-specific (false discovery rate <0.01). This information on genes, transcript structures, AS events and their tissue specificities in human and mouse are freely accessible on the TISA website (http://tisa.kribb.re.kr/AGC/).
Key words: alternative splicing; genome-based clustering; splice graph; transcript variants; tissue-specific; gene ontology
| 1. Introduction |
|---|
|
|
|---|
Alternative pre-mRNA splicing (AS) is a biological mechanism for producing multiple transcripts from a single gene. Genomic research has revealed that AS occurs in 3060% of human genes.1
Several different strategies have been applied to AS analysis, including (i) EST mapping against mRNA12
, (ii) mRNA/EST/protein mapping to the genome2
4
,13
15
, (iii) splicing microarray analysis16
18
and (iv) ab initio machine learning approaches.19
21
Among these methods, genome-based mRNA/EST alignment has been an important methodology for in silico detection of AS because ESTs and mRNAs are the most abundant resources and are considered direct evidence of the expressed genes. A number of algorithms to analyse AS using genomic alignment of expressed sequences have been reported. Kan et al.2
classified links between exons within a gene according to four types of connectivity, represented them as a matrix, and then constructed transcripts using matrix analysis to identify the paths that can be connected.
Recently, several studies have suggested graphical methods to represent gene structure.4
,22
25
In these methods, exons and introns are verified from structural information obtained by genomic mapping, and splice graphs are constructed in which exons and introns constitute nodes and edges, respectively; however, generating all possible isoforms by graph traversal can produce many false transcripts. Lee's group22
,26
tried to overcome this problem by using a dynamic programming method. They suggested use of the heaviest bundling algorithm, in which a transcript with only partial alignment information is extended to the maximum length of the path with the maximum likelihood. On the other hand, in ECgene, Kim et al. discriminated the isoforms into three groups by considering the minimum set of sequence data that indicates the reliability of the corresponding transcript.
AS makes it possible to generate genetic products with different functions from a single gene. This process is controlled by numerous splicing regulatory factors, including serine/arginine-rich proteins that can bind to cognate sequence elements in mRNA precursor such as exonic splicing enhancer, exonic splicing silencer, intron splicing enhancer and intronic splicing silencer sequences, and regulate the splice site selections.27
The outcome of the AS process can vary depending on several factors, such as the tissue, the developmental stage or the environmental conditions.1
,5
,28
Therefore, global studies of condition-specific AS are required to precisely understand the function of genes. Xu et al.29
examined tissue-specific AS using Bayesian statistics and EST tissue information, which focuses on only single AS events rather than on the transcript levels. Because each transcript may be the result of many sequential AS events, it is difficult to correlate the tissue specificity of an individual AS event with that of the transcript. In this study, we provide a methodology and results for reconstructing and classifying transcript variants from a splice graph that contains only valid AS sets according to binomial statistics (i.e. an observed frequency of >1% with a 95% confidence level). Moreover, we provide global information on the tissue specificity of transcript isoforms and the distinct AS events according to the organizational information from human and mouse ESTs.
| 2 Materials and Methods |
|---|
|
|
|---|
2.1 Data preparation
Human and mouse mRNAs deposited in GenBank (2006 June version) were retrieved using the following search term: human[Organism] AND biomol mRNA [Properties] NOT gbdiv est[Properties]. More than 7.5 million human ESTs and
4.6 million mouse ESTs were collected from dbEST (2006 June version). Repeat-masked genomic contig sequences of human and mouse were obtained from NCBI (Build 36 for human and mouse). The cDNA library information for human and mouse was downloaded from the Cancer Genome Anatomy Project (CGAP) website (http://cgap.nci.nih.gov/). Using CGAP library information, we classified all ESTs from 8616 human cDNA libraries into 46 tissues, except for embryonic tissues, head and neck, pooled tissue, uncharacterized tissue and whole body. We also separated the histological status of the ESTs into normal, cancer, or unknown according to the CGAP histology classification. We used only ESTs from the normal library for analysis of tissue specificity. Gene annotation including gene ontology (GO) assignments was performed using GeneKeyDB.30
2.2 Pre-processing and genomic mapping
We generated a non-redundant set of contigs with known chromosomal locations and orientations and used it as a template for genomic alignment. PolyA stretches of mRNA/ESTs were removed, and the EST sequences were trimmed by 20 nt at both ends as previously described.4
The presence of a polyA tail or polyA addition signal(s) for each query was recorded and used later for the detection of the gene terminus. We prepared a non-redundant set of mRNAs by running the nrdb program. Expressed sequences longer than 100 nt were positioned in the genome by homology-based comparison using BLAT31
and sim432
. After fast alignment using BLAT, top five best matches for a query were refined by sim4 alignment using the genomic region including 10kb flanking regions at both ends of the original BLAT-mapped position. We saved only sequences with >60% coverage and >92% identity with ESTs or >96% identity with mRNAs (Fig. 1). If several genomic loci satisfied the criteria for a single query, the best match was regarded as the proper alignment.
|
2.3 Genomic clustering
By scanning each chromosome from the start, we compared the genomic alignment structures of full-length mRNAs whose alignment region overlapped with clustered mRNAs that share at least one common splice site. If the alignment showed a distinct structure, we created a new cluster. From the first cluster, we constructed a preliminary reconstructed structure (RS) that corresponds to a gene unit composed of exon/intron sets and their linkage relationships. During the clustering process, we extended the gene structure whenever new exons or introns were recognized. After clustering the full-length mRNAs, partial mRNAs and ESTs aligned within the preliminary gene boundary were retrieved and compared with the RS structure. We ignored clustering in the genomic region without full-length mRNA mapping, even though many ESTs or partial mRNAs were aligned in that region. We allowed all possible splice sequences as well as canonical splice sequences (CSSs; i.e. GT-AT, GC-AG and AT-AC); however, in the case of non-CSSs, we approved them only if they were confirmed by at least five ESTs. The query orientation along the genome was re-evaluated based on the splice site sequence of the query. For example, if a splice sequence set of a given query was mainly composed of GT-AG, it is obvious that the query orientation along the genome was forward. If a splice sequence set was mainly composed of CT-AC, the query was aligned in a reverse direction. The orientation of a gene was determined as the same as the query. Comparisons of the splice site position and its orientation were used as the main criteria for deciding whether a query should be included in a given cluster.
2.4 Establishment of exon and intron sets
To measure the confidence of RS exons and introns, we applied a custom scoring scheme. The reliability of RS exons depends on (i) whether the splice sequences of introns linked at both ends of the exon are CSS or non-CSS and (ii) how many queries support the given exon or intron structure. For example, the exons linked with CSS introns at both ends and supported by a number of mRNAs and ESTs have high scores. Because full-length mRNAs provide more reliable evidence than ESTs, we imposed a double score for the exons and introns derived from full-length mRNAs. The group of RS exons/introns whose hard-edge boundaries are within ±6 nt were merged into a single exon or intron and rescored. In the process of RS exon/intron merging, the exon or intron with the best score was selected as a representative. Finally, RS exons with scores higher than the cutoff were saved and used for linkage analysis. The minimal requirements for saving an exon were that it should be supported by at least one EST whose boundaries are linked to CSS introns at both ends. For example, if an internal exon is supported by two ESTs and is linked with non-CSS introns at both ends, its exon score will be 0.4 and 0.4 for the score of non-CSS linked exon boundary is given as 0.2. Because the score of this exon is lower than the cutoff value (for first exon, [, 1.0]; internal exon, [1.0, 1.0]; last exon, [1.0, ]), it was discarded from the RS exon set. The same rule was applied to the intron scoring and selection (cutoff score = 1.0). We discarded internal exons shorter than 10 bp and introns shorter than 20 bp.
2.5 Evaluation and selection of alternative linkage relationships
We evaluated each mutually exclusive alternative linkage relationship using the exact binomial probability as previously reported.33
Briefly, k is the number of ESTs supporting a particular alternative splice Y, and n is the total number of EST supporting all of mutually exclusive splice patterns including Y. The observed frequency of Y (= k/n) is tested to see if it is higher than a given threshold frequency f with statistical significance. The binomial probability
indicates the probability that alternative splice Y occurs at least k times in n trials with a given frequency f. This can be a good estimation of whether the observed frequency of Y exceeds a given threshold frequency relative to all of the alternative splices in mutual relationships. If
is <0.05, then the observed frequency of the alternative splice is expected to exceed the given threshold with high (>95%) confidence. The formula used for the calculation is as follows:
![]() | (1) |
We applied a 1% frequency threshold (f = 0.01) and a 95% CI as cutoff criteria. We allowed only the alternative linkage relationships satisfying the above condition in the splice graph.
2.6 Transcript reconstruction and classification
After establishing the non-redundant exon/intron sets statistically verified as described above, we found all possible paths by traveling the nodes (i.e. exons) through edges (i.e. introns) in a splice graph starting from the first exon (eTag = 1) and finishing at the last exon (eTag = 3). From each path, we reconstructed a transcript structure in a given gene. For each transcript, we compared its genomic structure one by one with all of query structures clustered in a gene and arranged the list of queries whose structure do not collide with the transcript structure (Fig. 2C). Then, we checked the coverage of each transcript structure supported by non-colliding queries and subdivided the transcripts into three classes. In Class I, all of the exons and introns in a transcript were supported by query structure(s) or its combinations. As for Class II, at least one of exons or introns in a transcript was not validated. In Class III, there is no supporting evidence for a transcript. Class I transcripts represent most reliable set of transcripts generated from the splice graph. Class III transcripts are least reliable in which all the exonexon connections are only theoretically possible in a graph.
|
2.7 Statistical analyses for tissue specificity
If a particular gene has a large number of ESTs derived from a specific tissue, we assumed that the gene is likely to be expressed at a higher level in that tissue. We applied the statistical analysis designed by Audic and Claverie34
2.7.1 Audic's test
In Audic's test, the number of successfully clustered ESTs from all genes is N, the number of ESTs from a specific tissue is N2, the number of ESTs from all other tissues is N1, the total number of ESTs clustered on a given gene is n, the number of ESTs from a specific tissue in a given gene is y, and the number of ESTs from all other tissues in a given gene is x. Thus, N1 + N2 = N and x + y = n. We set the null hypothesis, Ho, i.e. there is no difference between the ratio of the specific tissue count for a gene and the number for ESTs in that tissue (y/N2) and the ratio of all other tissue counts in a gene and the number of ESTs in all other tissues (x/N1). Given that x was derived from N1, the conditional probability that y would be derived from N2 is as follows:34
![]() | (2) |
![]() | (3) |
If the P-value is <0.01, we deny the null hypothesis with 99% confidence, and the EST count in that tissue for a given gene is significantly different than the EST count in all other tissues, indicating that the gene is expressed at a higher level in that tissue.
2.7.2 BenjaminiYekutieli method for multiple test correction
To correct significance estimates for multiple hypothesis testing, we used the false discovery rate (FDR) method of Benjamini and Yekutieli,35
which calculates a new test statistic that estimates the portion of Type I errors within a group of hypotheses meeting a significance cutoff. The FDR is the expected proportion of erroneously rejected null hypotheses among the rejected ones. The BY procedure is performed as follows. The individual P-values are sorted from smallest to largest: P(1)
P(2)
P(m). Starting from the largest P-value (P(m)), P(i) is compared with a i/m/(1/1+1/2+
+1/m), where
is a desired level of FDR. This is continued as long as P(i) >
i/m/(1/1+1/2+
+1/m). Next, k is set as the first time P(i)
i/m/(1/1+1/2+
+1/m). The hypotheses corresponding to the smallest k P-values are then declared to be significant. We set 0.01 as the desired value of
and rejected the k hypotheses with P-values less than P(k) in each evaluation of tissue specificity. The BenjaminiYekutieli method considers the random dependence between the tests. We reasoned that the tissue specificity of the gene (or transcripts) in a particular tissue can be affected by the tissue specificities in other tissues for a given gene.
2.8 GO analysis
The full list of GO terms was downloaded from the Gene Ontology Consortium, and the directed acyclic graph structure of GO was rebuilt as similar with AmiGO (http://godatabase.org/). For each GO term, all associated child GO terms in the downstream branch nodes were gathered recursively by searching the whole GO tree structure for related entries with either an is a or a part of relationship. A total of 4134 GO terms were found to be associated with 9612 alternatively spliced human genes. Fisher's exact test using a 2 x 2 contingency table was applied to investigate whether there was any bias for a certain AS pattern amongst the nine AS patterns for a given GO term.36
In the table, the two columns are genes with the AS pattern and genes without the AS pattern, and the two rows are having the GO term and not having the GO term. Similarly, we carried out tests for the enrichment of specific GO terms in the group of tissue-specific genes or transcripts for each tissue. In these cases, the two columns are genes (or transcripts), tissue-specific in a certain tissue and genes (or transcripts), tissue-specific in all other tissues, and the two rows are having the GO term and not having the GO term.
| 3 Results and Discussion |
|---|
|
|
|---|
The overall process and the results of each step for human are summarized in Fig. 1. The process can be divided into three main steps: (i) construction of a splice graph based on genomic alignment of mRNA/EST; (ii) reconstruction of transcript variants and AS pattern analysis from the splice graph; and (iii) statistical analysis of the tissue specificity of the genes, transcripts and AS events using the tissue categories of the supporting ESTs.
3.1 Splice graph construction from the genomic alignments of mRNAs/ESTs
Using genome-based mRNA/EST mapping, we successfully localized
90% of mRNA/ESTs in the genome. We developed our own clustering program using genomically aligned full-length mRNA as the standard. Because our study focuses on AS analysis and forecasting of tissue-specific AS rather than on gene prediction, we conducted clustering in the genomic region with full-length mRNA mapping. Many partial mRNAs and ESTs aligned with the genome were discarded because a lack of full-length mRNA data. As a result, 178 807 (74%) mRNAs and 5 605 025 (72%) ESTs were finally incorporated into the gene clusters of the human genome. A total of 89 814 full-length mRNAs corresponded to 26 143 genes (Fig. 1). Multi-exon ESTs accounted for 58% (3 261 926) of all clustered ESTs.
RS is composed of exon/intron sets and exonexon linkage relationships. Based on the genomic structure of standard mRNA in a cluster, newly added mRNAs or ESTs were included into the RS if their alignment structures shared at least one splice site with the RS. We scored each exon and intron using our scoring rule. The exon score is an index of the confidence of the exon. We controlled the score so that the exon has a higher splice junction score when it has more supporting queries (mRNA/ESTs) and when the introns connected to the end of the exon have CSSs (GT-AG, GC-AG and AT-AC) that are more reliable. We gave differential weights to full-length mRNAs and ESTs, considering the qualitative difference in their sequences. This rule was also applied to intron scoring. Our algorithm also allowed non-CSS introns if they were supported by at least five ESTs. Exons or introns whose boundaries were within ±6 bp at both ends were merged into a single exon or intron, and its score was recalculated. The exon or intron with the best score was chosen as representative. The merging process corrects erroneous splicing that might occur during BLAT mapping or as a result of poor sequence quality of ESTs. Consequently, we were able to obtain reliable non-redundant exon/intron sets by exon/intron merging, rescoring and selection by using a cutoff score. Based on these steps, we constructed the splice graph and conducted the AS analysis.
3.2 Splice graph analysis
Our splice graph algorithm for detecting AS is essentially the same as previously reported.24
,25
The largest differences are in (i) the application of the selection step for statistically reliable alternative linkage relationships during splice graph construction and (ii) the classification of transcript variants.
3.2.1 Selection of statistically reliable alternative splices
If all of the alternative splices that can be obtained using erroneous EST data are allowed, one or more alternatively spliced forms could be detected for nearly all genes examined as long as they have sufficient EST coverage. Therefore, a method is needed to discriminate the true isoforms from erroneous splicing resulting from aberrant splicing or misalignment. Kan et al.33
suggested a statistical method to solve this problem, wherein alternative splices are selected according to the frequency at which they occur with a high confidence level. One example is shown in Fig. 2. There are 45 supporting queries (13 mRNAs and 32 ESTs) for the human SCARF1 (scavenger receptor Class F, member 1) gene cluster. Among 13 mRNAs queries, 12 are full-length. Seven AS events can be defined from the comparisons of their alignment structures: two alternative first exon events (E1-E3 versus E2-E3, E4-E6 versus E5-E6), complex splice site variation (E7-E13 versus E12-E13), two 3'-splice variations (E8-E10 versus E8-E11, E15-E16 versus E15-E17), intron retention (E8-E10 versus E9) and alternative last exon (E13-E15 versus E14). The seven AS events and the number of sequences supporting each isoform are summarized in Table 2. Four of them (3, 4, 5 and 7) have sufficient sequence evidence for the minor isoform relative to the major isoform. AS events 1, 2 and 6, however, have only one supporting query for each, whereas the query count for the each counterpart major isoform is much larger, yielding a low observed frequency for the minor isoform. To determine whether the significance of the observed frequency (for example, 1/33 for AS event 1) of the minor isoform is higher than the threshold (frequency of 1%) or not, we calculated P-value adopting binomial statistics. The binomial probability of each AS event was calculated based on the number of supporting queries for the isoform. As shown in Table 2, the P-value for the minor isoform of AS event 1 exceeding the given threshold (1%) was 0.282. As similar, P-values for AS events 2 and 6 show the value higher than the threshold, whereas the P-values of the other four AS events were under 0.05 (i.e. 95% confidence). Thus, these three events were excluded from the splice graph (Fig. 2B).
3.2.2 Reconstruction of transcript isoforms and their classification
After the selection of a reliable alternative splice set, we built the adjacency list for the directed acyclic graph and constructed all possible combinations of paths through graph traversal as described previously.24
,25
Each path represents a transcript variant. By comparing the alignment structures of each transcript and its supporting queries, we subdivided the transcript isoforms as follows: Class I, all of the exons and introns in the transcript were supported by query structures; Class II, some of exons or introns were not validated; and Class III, none of the exons or introns had supporting evidence (Fig. 2C). We also calculated the total score of each transcript based on the number of supporting queries, considering the exon/intron coverage and sorted transcripts according to their scores. In the example of Fig. 2, a total of eight transcripts were constructed by graph traversal: isoforms 15 were Class I; 68 were Class II.
Table 1 shows the results of the analysis of the AS and gene modelling from the graph algorithm based on genomic clustering data. In the human genome, 13 560 (66%) genes were found to have >1 transcript, and in the mouse genome, 11 752 (56%) genes were estimated to have at least one transcript. A total of 402 995 transcripts were formed, of which 97 286 (24%) in human and 33% (64 255 of 197 361) in mouse were Class I. On average, 6.5 and 5.0 isoforms were generated from one AS gene in human and mouse respectively. The AS patterns and tissue specificities of the transcripts were examined only for Class I transcripts.
|
|
Given that a transcript is an output of several possible combinations of individual AS events in an individual gene, the prediction and reconstruction of transcript isoforms is another challenging problem. The simplest method is to generate all possible combinations from the graph traversal. As noted above, however, this may produce many false transcripts. Thus, we evaluated its reliability according to EST evidence, and we classified each set of transcripts into three groups. As other good measures to determine the transcript reliability, Kim et al. 25
How many genes generate multiple transcripts and how many transcripts exist in humans are very important questions that remain unanswered. Exon-junction microarray analysis shows that
75% of human multi-exon genes undergo AS.16
In our case, 52% of all human genes had multiple transcripts, although the percentage of AS genes increases to 66% if we consider only multi-exon genes with EST support. The percentage of AS genes is very similar to the results of other studies based on graph algorithms;13
,23
however, the number of transcript variants from our study is smaller than other reported values. The low outcome is largely due to the application of strict selection criteria for the AS relationships, namely, a frequency cutoff of 0.01 (1%) with 95% statistical confidence. Without adopting the statistical selection for AS events results in >200 000 Class I transcripts and a total of
1 000 000 transcripts for all three classes, which is comparable to the findings of other studies in the ESTGenes, ECgene, ASG and ASD databases (Table 4). In these cases, however, the splice graph will contain many AS events with very low frequency (<0.001) and many transcripts would be generated from the splice graph. We could not guarantee the existence of these AS events in real tissues or organs. Moreover, as mentioned previously,33
if we allow all of the splice variations based on ESTs, we found that >95% of the multi-exon genes with >300 EST supports undergo AS events. The expectation that almost all genes undergo AS if there are sufficient ESTs does not seem to be reasonable. Thus, we applied a two-step selection strategy as described above. Using our strategy, we found that the percent of AS genes converged to
75% even though EST coverage in a gene cluster increased to >1000 (data not shown). Therefore, we suppose the Class I transcripts with small number and AS events defined in our study are the reliable sets confidently selected from all possible combinations.
3.3 Detection of AS events
For each gene, we defined and examined nine types of AS patterns by exonexon and exonintron linkage analysis of Class I transcripts: cassette exon, 5'-splice site variation, 3'-splice site variation, mutually exclusive exon, multiple cassette exon, overlapping exon, composite splice site variation, intron retention and multiple intron retention (Fig. 3). In human, cassette exon was the most frequent AS event, followed by intron retention and multiple cassette exon. Table 3 summarizes the results of the nine AS patterns. Alternative exons in the defined AS patterns accounted for
41% of all alternative exons in our study. Characterization and investigation of more complex AS patterns would allow discovery of additional AS events.
|
|
|
Because the method depends on mRNA/EST evidence, it cannot identify novel AS events if there is no sequence evidence. Usually, ESTs are generated by single-pass sequencing of cDNA clones at both ends, resulting in relatively low coverage in the genes' internal region. Although >7.5 million human ESTs are in the public database, we could not determine whether that is sufficient for the AS detection. To overcome this problem, several approaches have been employed, including splice microarrays to detect AS,16
3.4 Statistical analysis of the tissue specificity of genes and transcript variants
Using tissue information for ESTs, we have constructed digital expression profiles for each gene and transcript. From 8616 cDNA libraries, there were 7 544 537 ESTs classified into 46 human tissues. Individual ESTs were classified as normal- or cancer-derived according to their histological status. We used only normal ESTs. If the proportion of ESTs from a specific tissue was significantly higher than that from other tissues, the gene would have a higher probability of tissue-specific expression in that tissue. In the example shown in Fig. 4, there are two possible isoforms in human TFAP2B (transcription factor AP-2 beta) gene. Among 36 supporting ESTs, 24 are common to both isoforms and 4 are singletons. For the tissue specificity of the gene in retina tissue, a 2 x 2 contingency table was constructed (Fig. 4C) and the P-value was calculated as described in the Materials and Methods. In this case, the sample gene is probably retina-specific because its P-value was much <0.01 (99% confidence level). For the other tissues, the same evaluations were performed sequentially. The contingency table of the AS event was constructed in the same way (Fig. 4D) as for retina tissue (i.e. cassette exon event in E2 and the P-value calculated using Fisher's exact test). In this case, the cassette exon event is not predicted to be retina-specific in 99% confidence level (however it is significantly retina-specific in 95% confidence level).
|
The tissue specificity of the transcript level is determined using essentially the same approach as the tissue specificity of the genes, although one EST could be included in both transcripts. To handle this multiple inclusion problem, we tried two different tissue-EST counting methods. If the multiple inclusion of an EST is ignored, it is possible to obtain a tissue-EST table like that shown for Case 1 in Fig. 4E. The default method is to calculate the tissue specificity as for the genes from the table (Fig. 4F, Case 1). To improve this method, we introduced a weighting factor that provides a different score for EST tissue counting (Fig. 4E, Case 2). If an EST is derived from a certain tissue and is included in n transcript isoforms, its tissue contribution is given as 1/n. The tissue count was rounded off to the nearest integer prior to the P-value calculation. In our study, we chose the Case 2 counting method for tissue-specific analysis of transcripts.
By applying Audic's test and the BenjaminiYekutieli method, we predicted 12 711 tissue-specific genes and 16 016 tissue-specific transcripts (Table 5) for the condition of FDR <0.01 and a tissue-EST count
3. In mouse, 15 895 genes and 19861 transcripts were predicted to be tissue-specific in 44 tissues. It should be noted that a gene or a transcript can be predicted to be tissue-specific in more than one tissue. Among tissue-specific genes and transcripts in human, 7522 genes (59%) and 11 592 transcripts (72%) were specific only to a single tissue. We have checked the conservation of tissue-specificity between orthologous human and mouse gene pair. Orthologous groups were assigned to each gene according to HomoloGene (NCBI Build48.1). Of 4965 orthologous pairs specific to a single tissue, 11.7% (584 genes) were conserved in its tissue-specificity. In particular, testis-specific genes occupied a major portion (58%) among these tissue-conserved genes (Fig. 5).
|
|
3.5 GO analysis
In human, 14 570 genes were assigned to 4666 distinct GO terms across the three functional categories: biological process, molecular function and cellular components. Among these, 9612 alternatively spliced genes were assigned to 4134 GO terms and 4958 non-alternatively spliced genes were assigned to 2615 GO terms. In mouse, 16 271 genes were assigned to 4613 GO terms and 8519 alternatively spliced genes to 3770 GO terms. GO analysis was carried out to determine whether there was any functional bias of the gene groups for a certain AS pattern. For example, we examined whether the genes with a cassette exon event showed significant enrichment in any of the GO terms. By applying Fisher's exact test, we discovered the differences between the groups of the genes in the different AS types. The selection criteria for the AS pattern and GO term were as follows: GO depth in the tree structure
4, number of genes in a given AS pattern for a given GO term
10 and P-value
0.01. The results are listed at Table 6. Interestingly, genes involved in protein modification including phosphorylation and dephosphorylation had relatively more cassette exon events than other AS events. Also, multiple cassette exon events were substantially enriched with genes involved in apoptosis and mitosis. Mutually exclusive exon events were substantially enriched with ion transport and actin cytoskeleton genes. The genes for chromosome structure formation and modification were coupled with 5'-splice site variation. Genes involved in nucleotide metabolism were the most common in the 3'-splice site variation. For the other AS patterns, there were no significant couplings with any gene group. A previous study also reported the statistical differences in the AS types commonly arising in different tissues.38
|
Each tissue has its own characteristics, resulting in a unique biological function, morphology and physiology. These characteristics are due to the differential expressions of genes and/or transcripts. Similar to the analysis of GO terms for different AS patterns, we searched for GO terms significantly enriched within the group of tissue-specific transcripts. Significantly enriched GO terms for the group of tissue-specific transcripts in each tissue are in accordance with our previous knowledge. For example, GO terms closely related to brain functions (i.e. central nervous system development, synaptic vesicle, ion channel activity and neurotransmitter transport) were found to be enriched in brain-specific transcripts (See Supplemental Tables 1 and 2). Similar profiles were obtained for the tissue-specific genes. These results provide indirect evidence of the reliability of our approach for identifying tissue-specific genes and transcripts.
In conclusion, we have presented a detailed algorithm for genome-based clustering and development of a splice graph to identify AS events and variable transcripts derived from individual genes. Our algorithm reduced the possibility of false transcript formation because we applied a scoring rule and a statistical method for the selection of valid AS sets so that only exon/intron sets and alternative linkages with sufficient evidence were selected and used to build the splice graph. We also carried out additional classification of the final transcript based on the mRNA/EST query alignment information supporting the transcript structure. We then subdivided the transcript into one of the three classes according to its reliability. We also provided statistical applications for identifying tissue-specific genes, transcripts and AS events based on EST tissue information. By adapting very conservative criteria for genomic mapping and allowing statistically significant alternative splice relationships, we identified AS information and transcript variants with fewer false positives. The GO analyses support our methodology and tissue specificity results, suggesting that we correctly identified the set of tissue-specific transcripts.
Our method constitutes an integrated system and can be easily applied to other organisms for which full genome sequences are available and the number of complete mRNAs is sufficient. In this study, we applied our system to human and mouse sequences. The results obtained from this study might help clarify the complex biological nature of AS and the differential expression of transcript variants in many tissues. This can also provide information about the reliable exon sets, including potential AS exons, which can be employed in other AS detection methods, including the design of a splice microarray and a training set for the machine learning approach.
| Acknowledgements |
|---|
|
|
|---|
This work was supported by grants (nos. FGM0300411 and FGM0300512) from the Korean Science and Engineering Foundation to the international joint research project (M6-0401-00-0178) between the Korean Research Institute of Bioscience and Biotechnology and Weizmann Institute of Science in Israel.
Supplementary Data: Supplementary material is available online at http://www.dnaresearch.oxfordjournals.org.
| Footnotes |
|---|
*To whom correspondence should be addressed. Tel: +82-42-879-8560, Fax: +82-42-879-8569, E-mail: hurlee{at}kribb.re.kr
| REFERENCES |
|---|
|
|
|---|
- Modrek, B. and Lee, C. 2002, A genomic view of AS, Nature Genet., 30, 1319.[CrossRef][ISI][Medline]
- Kan, Z., Rouchka, E.C., Gish, W., and States, D. 2001, Gene structure prediction and AS analysis using genomically aligned ESTs, Genome Res., 11, 889900.
[Abstract/Free Full Text] - Modrek, B., Resch, A., Grasso, C., and Lee, C. 2001, Genome-wide detection of alternative slicing in expressed sequences of human genes, Nucleic Acids Res., 29, 28502859.
[Abstract/Free Full Text] - Eyras, E., Caccamo, M., Curwen, V., and Clamp, M. 2004, ESTGenes: AS from ESTs in Ensembl, Genome Res., 14, 976987.
[Abstract/Free Full Text] - Maniatis, T. and Tasic, B. 2002, Alternative pre-mRNA splicing and proteome expansion in metazoans, Nature, 418, 236243.[CrossRef][Medline]
- Stamm, S. 2002, Signals and their transduction pathways regulating AS: a new dimension of the human genome, Hum. Mol. Genet., 11, 24092416.
[Abstract/Free Full Text] - Kornblihtt, A. R., De La Mata, M., Fededa, J.P., et al. 2004, Multiple links between transcription and splicing, RNA, 10, 14891498.
[Abstract/Free Full Text] - Lewis, B. P., Green, R.E., and Brenner, S. E. 2003, Evidence for the widespread coupling of AS and nonsense-mediated mRNA decay in humans, Proc. Natl. Acad. Sci.USA, 100, 189192.
[Abstract/Free Full Text] - Cartegni, L., Chew, S., and Krainer, A. 2002, Listening to silence and understanding nonsense: exonic mutations that affect splicing, Nature Rev., 3, 285298.
- Bracco, L. and Kearsey, J. 2003, The relevance of alternative RNA splicing to pharmacogenomics, Trends Biotechnol., 21, 346353.[CrossRef][ISI][Medline]
- Gracia-Blanco, M. A., Baraniak, A. P., and Lasda, E. L. 2004, AS in disease and therapy, Nature Biotech., 22, 535546.[CrossRef][ISI][Medline]
- Brett, D., Hanke, J., Lehmann, G., et al. 2000, EST comparison indicates 38% of human mRNAs contain possible alternative splice forms, FEBS Lett., 474, 8386.[CrossRef][ISI][Medline]
- Thanaraj, T. A., Stamm, S., Clark, F., et al. 2004, ASD: the AS Database, Nucleic Acids Res., 32, D64D69.
[Abstract/Free Full Text] - Gupta, S., Zink, D., Korn, B., et al. 2004, Genome wide identification and classification of AS based on EST data, Bioinformatics, 20, 25792585.
[Abstract/Free Full Text] - Huang, H. D., Horng, J. T., Lee, C. C., Liu, B.J. 2003, ProSplicer: a database of putative AS information derived from protein, mRNA and expressed sequence tag sequence data, Genome Biol., 4, R29.[CrossRef][Medline]
- Johnson, J. M., Castle, J., Garrett-Engele, P., et al. 2003, Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays, Science, 302, 21412144.
[Abstract/Free Full Text] - Pan, Q., Shai, O., Misquitta, C., et al. 2004, Revealing global regulatory features of mammalian AS using a quantitative microarray platform, Mol. Cell, 16, 929941.[CrossRef][ISI][Medline]
- Fehlbaum, P., Guihal, C., Bracco, L., and Cochet, O. 2005, A microarray configuration to quantify expression levels and relative abundance of splice variants, Nucleic Acids Res., 33, e47.
[Abstract/Free Full Text] - Cawley, S. and Pachter, L. 2003, HMM sampling and applications to gene finding and AS, Bioinformatics, 19, ii36ii41.[Abstract]
- Dror, G., Soreck, R., and Shamir, R. 2004, Accurate identification of alternatively spliced exons using support vector machine, Bioinformatics, 21, 897901.
- Ratsch, G., Sonnenburg, S., and Scholkopf, B. 2005, RASE: recognition of alternatively spliced exons in C. elegans, Bioinformatics, 21, i369i377.[Abstract]
- Lee, C. 2003, Generating consensus sequences from partial order multiple sequence alignment graphs, Bioinformatics, 19, 9991008.
[Abstract/Free Full Text] - Leipzig, J., Pevzner, P., and Heber, S. 2004, The AS Gallery (ASG): bridging the gap between genome and transcriptome, Nucleic Acids Res., 32, 39773983.
[Abstract/Free Full Text] - Lee, B. T. K., Tan, T. W., Ranganathan, S. 2004, DEDB: a database of Drosophila melanogaster exons in splicing graph form, BMC Bioinformatics, 5, 189.[CrossRef][Medline]
- Kim, N., Shin, S., and Lee, S. 2005, ECgene: Genome-based EST clustering and gene modeling for AS, Genome Res., 15, 566576.
[Abstract/Free Full Text] - Xing, Y., Resch, A., and Lee, C. 2004, The multiassembly problem: Reconstructing multiple transcript isoforms from EST fragment mixtures, Genome Res., 14, 426441.
[Abstract/Free Full Text] - Hertel, K. J., Lynch, K. W., Maniatis, T. 1997, Common themes in the function of transcription and splicing enhancers, Curr. Opin. Cell Biol., 9, 350357.[CrossRef][ISI][Medline]
- Ladd, A.N. and Cooper, T.A. 2002, Finding signals that regulate AS in the post-genomic era, Genome Biol., 3, 0008.
- Xu, Q., Modrek, B., Lee, C. 2002, Genome-wide detection of tissue-specific AS in the human transcriptome, Nucleic Acids Res., 30, 37543766.
[Abstract/Free Full Text] - Kirov, S. A., Peng, X., Baker, E., et al. 2005, GeneKeyDB: A lightweight, gene-centric, relational database to support data mining environments, BMC Bioinformatics, 6, 72.[CrossRef][Medline]
- Kent, W.J. 2002, BLAT-The BLAST-Like Alignment Tool, Genome Res., 12, 565664.
- Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., Miller, W. 1998, Computer program for aligning a cDNA sequence with a genomic DNA sequence, Genome Res, 8, 967974.
- Kan, Z., Startes, D., Gish, W. 2002, Selecting for functional alternative splices in ESTs, Genome Res., 12, 18371845.
[Abstract/Free Full Text] - Audic, S. and Claverie, J. M. 1997, The significance of digital gene expression profiles, Genome Res., 7, 986995.
[Abstract/Free Full Text] - Benjamini, Y. and Yekutieli, D. 2001, The control of the false discovery rate in multiple hypotheses testing under dependency, Annal. Stat., 29, 11651188.[CrossRef]
- Zeeberg, B. R., Feng, W., Wang, G., et al. 2003, GoMiner: a resource for biological interpretation of genomic and proteomic data, Genome Biol., 4, R28.[CrossRef][Medline]
- Sorek, R., Shemesh, R., Cohen, Y., et al. 2004, A non-EST-based method for exon-skipping prediction, Genome Res., 14, 16171623.
[Abstract/Free Full Text] - Yeo, G., Holste, D., Kreiman, G., and Burge, C. B. 2004, Variation in AS across human tissues, Genome Biol., 5, R74.[CrossRef][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







