DNA Research Advance Access originally published online on February 7, 2008
DNA Research 2008 15(1):3-11; doi:10.1093/dnares/dsm034
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Markov Chain-based Promoter Structure Modeling for Tissue-specific Expression Pattern Prediction
1 Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan
2 Department of Life Science, University of Hyogo, 3-2-1 Kouto, Kamigori, Ako-gun, Hyogo 678-1297, Japan
3 Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan
Received 17 October 2007; accepted 11 December 2007.
| Abstract |
|---|
|
|
|---|
Transcriptional regulation is the first level of regulation of gene expression and is therefore a major topic in computational biology. Genes with similar expression patterns can be assumed to be co-regulated at the transcriptional level by promoter sequences with a similar structure. Current approaches for modeling shared regulatory features tend to focus mainly on clustering of cis-regulatory sites. Here we introduce a Markov chain-based promoter structure model that uses both shared motifs and shared features from an input set of promoter sequences to predict candidate genes with similar expression. The model uses positional preference, order, and orientation of motifs. The trained model is used to score a genomic set of promoter sequences: high-scoring promoters are assumed to have a structure similar to the input sequences and are thus expected to drive similar expression patterns. We applied our model on two datasets in Caenorhabditis elegans and in Ciona intestinalis. Both computational and experimental verifications indicate that this model is capable of predicting candidate promoters driving similar expression patterns as the input-regulatory sequences. This model can be useful for finding promising candidate genes for wet-lab experiments and for increasing our understanding of transcriptional regulation.
Key words: regulation of transcription; Markov chain; promoter modeling; in situ hybridization; transcription factor binding site
| 1. Introduction |
|---|
|
|
|---|
Gene expression in metazoans is regulated at many levels. Regulation of transcription is the first step in the cascade of regulation and is thus of great importance for our understanding of gene expression. Regulation of transcription is determined by the binding of transcription factors (TFs) to their corresponding TF binding sites (TFBSs), and regulatory sequences containing similar sets of TFBSs are expected to be under the control of similar sets of TFs and drive similar expression patterns. Hence, the identification of TFBSs has become a key factor in unraveling the transcriptional regulation mystery. Unfortunately, the identification of these cis-regulatory elements by wet-lab experiments is time-consuming and labor-intensive. Computational methods have come to the rescue, but both their sensitivity and selectivity are severely hampered by the nature of the target motifs: TFBSs tend to be short (typically six to 15 bp) and degenerate, while the sequence in which they are located can be over 10 kb in length. Looking only at the oligonucleotide sequence recognized by a TF, we can expect a number of biologically meaningless occurrences in almost every promoter sequence. The most popular way of modeling TFBSs is by position weight matrices.1
There is growing evidence that TFs do not work alone, but rather cooperate to confer a specific spatio-temporal expression pattern; for instance, TFs bind to sites located in close proximity to each other, the so-called cis-regulatory modules (CRMs). It is, therefore, not surprising that many approaches to improve the accuracy of tissue expression prediction have focused on clustered groups of predicted binding sites.7
–14
Zhao et al., for example, predicted regulatory modules in Caenorhabditis elegans based on the clustering of a set of motifs correlating with muscle-specific gene expression. Clusters are defined simply as sites of the motifs positioned within a certain distance from each other. Blanchette et al. described a more complex approach where large numbers of PWMs are used to find statistically significant clusters of phylogenetically conserved sites in windows of 100 to 2000 bp. However, focusing only on clustered groups of predicted binding sites might be too simplistic an approach to the problem of TFBS detection and regulatory region architecture modeling. First, most of these approaches do not take into account solitary sites at all, even though some of them are likely to be functional. Secondly, in many CRM-modeling approaches, additional features of TFBSs, such as orientation, positional bias with respect to the transcription or translation start site, and order are ignored, although a number of studies have illustrated the importance of these features for some TFBSs.15
–17
In a genome-wide analysis of TFBSs in the mouse genome, Sharov et al. found that a considerable number of TFBSs showed a significant bias in their orientation. Berendzen et al. studied the importance of position and orientation of cis-regulatory elements and promoter motifs in a number of species. Their results show that several known functional elements appear to be relatively enriched at defined sites in the promoter region. Terai and Takagi showed that it is possible to find motif combinations in yeast that are significantly associated with a certain expression profile if their order is restricted, whereas they are not associated if their order is not taken into account. Methods that have tried to use such features are few in number. The Dragon Promoter Mapper uses a number of motif features such as the orientation, the order, and distances between adjacent motifs.18
However, the distance to the transcription or translation start site is not taken into account, and the model might have difficulties with motifs that lack a conserved distance between their sites showing a general preference for a certain region within the promoter region. Methods as the one described by Ohler et al. use more diverse physical properties such as DNA bendability, GC content, or stacking energy in addition to predicted TATA-boxes and initiator sites.19
These methods, however, need hundreds of training sequences and focus only on the core promoter region, where the distances between functional elements are strongly conserved. In addition, the final goal of these programs is fundamentally different from ours. While we predict the promoters with a similar architecture as an input set of promoters, they merely predict the presence or absence of a promoter.
We introduce here a simple Markov chain-based promoter architecture model as an alternative to the existing CRM models. Our model is trained using an input set of promoter sequences and captures information about the orientation, the positional bias, and the order of predicted occurrences of motifs that are over-represented in the input sequences. Subsequently, the trained model is used to predict genes having similar expression patterns.
We applied our model to two promoter sequence datasets: a set of promoter sequences driving expression in pharyngeal muscle cells in C. elegans and a set of muscle-specific promoters in Ciona intestinalis. The muscle system of C. elegans has been extensively studied, and the regulatory regions and expression patterns of a number of genes are relatively well known. C. intestinalis is a chordate model organism that has shown to be very useful for the study of developmental and evolutionary biology, and recently a number of studies have focused on the transcriptional regulation of muscle-specific genes in this organism.20
–23
The availability of relatively well-annotated expression information for C. elegans and the recent interest in the Ciona muscle regulatory system have determined the choice of our datasets. For both sets we trained the model and used it to predict new candidate promoters with similar expression patterns as the input promoter sequences. Finally, our predictions were verified for their accuracy, using both available annotation data and new wet-lab experiments.
| 2. Methods |
|---|
|
|
|---|
2.1. Selection of input sequence datasets
The genomic set of C. elegans promoter sequences was obtained from WormMart (http://www.wormbase.org/, WormBase Release WS170). For each transcript, the 3000 bp upstream of the translation start site were downloaded, and overlapping upstream open reading frames (ORFs) were removed. Finally, repeats were masked using RepeatMasker (version 3.0; http://www.repeatmasker.org).24
For C. intestinalis, the genomic set of promoters was obtained from BioMart (Release JGI2).25
For each transcript, the 3000 bp upstream of the translation start site were downloaded, overlapping upstream ORFs removed, and repeats masked. The promoter sequences of 19 genes previously shown to be expressed in muscle were used to construct the input promoter data set (see Supplementary Material Section 1).
2.2. Identification of useful over-represented motifs
Over-represented motifs were predicted in each input dataset using the motif-finding programs, MEME,26
Weeder,27
and AlignACE.28
In order not to overlook any significant motifs, different runs on different regions of the input promoter sequences were done, for both strands as well as for single strands (see Supplementary Material Section 2). Finally, only motifs having a length between 6 and 15 bp and having more than five predicted occurrences in their input dataset were considered. Motifs that were obvious repeats (AT-repeats, AT-rich stretches, etc.) were removed. We further removed redundant motifs and selected up to 10 motifs that showed to be the most significantly over-represented in the input promoter sequences. This final set of motifs for each set of input promoter sequences was then used for further analysis and to construct the promoter architecture model.
To model the binding site preferences of each motif, we used a variable-order Bayesian network (VOBN), based on a method described previously.29
Basically, a VOBN model can be regarded as a PWM, with the only difference that a first-order dependency between positions in the motif is allowed. In this study, we have set the conditions for introducing a dependency very strict. Because of this, the majority of positions in the VOBN are of zero-th order, which implies that many motifs are basically modeled as PWMs. Only in cases where a strong dependency was found, a first-order dependency between the positions was introduced (see Supplementary Material Section 3). To model the background nucleotide frequencies of the promoter sequences, both for C. elegans and for C. intestinalis, an interpolated Markov model (IMM) was trained from the genomic set of promoter sequences of each organism. This was done using the procedure described by Salzberg et al.30
The IMM we used here is at most of eighth order. However, in cases where training data is scarce, an interpolation is made with a lower order. Thus the IMM takes advantage of the greater accuracy of higher-order models and, at the same time, avoids the problem of over-fitting caused by insufficient training data.
2.3. Positional bias and promoter sequence partitioning
Using the final set of motifs, for each dataset we scanned the input sequences for occurrences of each motif and represented them visually. As is discussed in the Results and discussion section, in many cases the occurrences of a motif show a certain tendency to appear in a certain region of the promoter sequences. Often, this is the region proximal to the transcription or translational start site; in other cases, it is a region further upstream. Keeping this heterogeneity in mind, it would be unwise to make one single model for the entire promoter structure, not taking into consideration the positional bias of some motifs.
Hughes et al. have introduced a measure for the degree of positional bias P for a motif,28
as calculated by Equation (1).
![]() | (1) |
2.4. Model training
First-order Markov chains were trained for both the proximal and distal region of the input sequences, and similarly for a set of negative control sequences. Since in practice for many organisms there is little to no information available on tissue-specific expression, the entire set of genomic promoter sequences with their predicted sites was used as negative control set. The conditional probabilities of the Markov chains will be denoted as
|
| (2) |
|
| (3) |
2.5. Scoring promoter sequences
For each promoter sequence to score, we took the regions with their motif sequence (from 3' to 5') and the LLR values corresponding to the particular region. The final score Scoretotal of a promoter sequence is then given by the following equation:
![]() | (4) |
|
As indicated above, the transitions between motifs that are more characteristic for input promoter sequences will have a high positive score, whereas transitions between motifs that are more characteristic for non-input promoter sequences will have a high negative score. Thus, promoters having similar sets of motifs as the input sequences, with similar orientations and orders as those of the input sequences, in the same region get a high positive score. High-scoring promoters have a structure similar to the input promoters and are, hence, assumed to drive similar expressions. As can be seen from Equation 4, the individual strength of predicted motif sites does not contribute to the score of a sequence.
2.6. In situ hybridization experiments
Mature adults of C. intestinalis were collected from harbors in Murotsu, Hyogo, Japan, and maintained in indoor tanks of artificial seawater at 18°C. Larvae were obtained as described previously20
and fixed overnight in 4% paraformaldehyde in 0.5 M NaCl, 0.1 M, pH 7.5, 3-(N-morpholino) propanesulfonic acid (MOPS) buffer prior to storage in 80% ethanol at –30°C. As the template to synthesize digoxigenin-labeled antisense RNA probes, cDNA clones were obtained from C. intestinalis Gene Collection Release 1.31
The RNA probes were synthesized using a DIG RNA labeling kit (Roche Diagnostics, Indianapolis, IN). In situ hybridization of whole-mount specimens was carried out as described previously.32
The larvae were mounted in 50% glycerol containing 2% 1,4-diazabicyclo-2,2,2-octane (DABCO) and observed under a confocal microscope LSM 510 (Zeiss).
| 3. Results and discussion |
|---|
|
|
|---|
3.1. C. elegans pharyngeal muscle promoter architecture model
In a set of 20 promoter sequences reported to drive expression in pharyngeal muscle cells of C. elegans, we predicted over-represented motifs (see Section 2). Table 1 shows the positional bias score of the found motifs, together with the 300 bp window in the input promoter sequences where the occurrences of each motif are the most abundant. None of these motifs showed a significant similarity to motifs in the JASPAR database.33
|
In a next step, the genome-wide set of promoters of C. elegans was scored by the trained model (see Methods). Of the 20 input promoters, 11 were in the 100 top-scoring sequences (see Supplementary Material Section 5). To verify the validity of this prediction, we used the expression annotation that can be found in WormBase. For the 100 highest scoring non-input genes having a tissue expression annotation (the first one being ranked 30th, the last one 606th out of 24 446 promoters), we determined which tissues were statistically over-represented. We found that the 100 top-scoring annotated non-input genes are enriched for genes expressed in pharyngeal muscles (10 genes, 4.1 expected by chance, P-value = 0.0025) and muscle tissue in general (42 genes, 30.9 expected by chance, P-value = 0.0072). There was also a slight enrichment for genes expressed in motor neurons (7 genes, 2.4 expected by chance, P-value = 0.0110), which are involved in the regulation of muscle contractions. It is not surprising to find the top-scoring genes to be enriched for not only pharyngeal muscle cells, but also muscle tissues in general, as the input genes tissue expression patterns were not restricted to only pharyngeal muscle cells but also included other muscle tissues. Table 2 shows the 10 top-scoring annotated non-input genes and their expression annotation as reported on WormBase. Five of these 10 genes are reported to be expressed in muscle tissue, one of them in pharyngeal muscles. In addition, some genes are reported to be expressed in neurons and motor neurons. It is known that muscle genes and neuronal genes share some regulatory elements, and other studies have reported similar observations.13
|
3.2. C. intestinalis muscle promoter architecture model
We predicted over-represented motifs in a set of 19 C. intestinalis promoter sequences known to drive expression in C. intestinalis muscle tissue. Table 3 shows the positional bias scores of the detected over-represented motifs. Motif Cin_M1 and motif Cin_M3 have consensus sequences that are highly similar to those of motifs that have been reported before as playing a crucial role in transcriptional regulation of muscle genes in C. intestinalis.20
|
For this model, predicted TATA-boxes were used as reference points instead of the translational start sites. Sequences in which we could not find a TATA-box within a reasonable distance of the coding region were anchored at the position 100 bp upstream of the translation start site. Given the preference of some motifs for the proximal region, the promoter architecture model was partitioned into a proximal part (from the translation start site to 250 bp upstream of the predicted TATA-box) and a distal part (from 250 bp upstream of the predicted TATA-box until the 5' end of the promoter sequence). Although our model does not include a direct way to model the clustering of sites, Table 2 illustrates that the proximal part of the promoter model was denser in motif occurrences than the distal part.
For both regions, a first-order Markov chain was trained (see Methods and Supplementary Material Section 4). The genomic set of promoter sequences of C. intestinalis was then scored using the trained model. The promoters were ranked by their score and the top-scoring genes selected for further analysis. Of the 19 input promoters, 16 are in the top 100 scoring sequences (see Supplementary Material Section 5). As a verification of the predictions, expression patterns of non-input genes from the top 50 scoring sequences were analyzed experimentally by in situ hybridization. Among the 29 non-input sequences in the top 50 list, three sequences, all of which encode muscle actin, were excluded from the analysis because their muscle-specific nature was obvious. Other two genes, whose expression patterns had been already known, were also excluded. Among the 24 sequences remaining, cDNA clones were available for four predicted sequences in C. intestinalis Gene Collection Release 1,31
and they were used to synthesize RNA probes for the in situ hybridization analysis. The results of these experiments are shown in Fig. 2. For three of the four tested genes, expression was observed in muscle cells in the tail of the C. intestinalis larva. A fourth gene showed expression in the central nervous system and mesenchyme, but not in muscle tissue.
|
3.3. Conclusion
We have introduced a simple promoter architecture model that uses the positional bias, the orientation bias, and the order of predicted sites of a set of motifs to predict promoter sequences that drive similar expression patterns as the input promoter sequences. As this model does not directly model the clustering of motifs, it can be considered as an alternative to the existing CRM-based models. During the training of the model, only one parameter needs to be set (i.e. the position of the boundary between the regions). We did not use any tissue-specific or organism-specific information in the construction of the model, so we can expect it to be applicable in other tissues and organisms as well. The fact that we could successfully predict expression profiles in two organisms illustrates the general applicability of our approach. Moreover, the motifs we used in the two datasets and described here were based solely on computational predictions, illustrating that this method does not require prior knowledge of the regulatory factors involved and their binding sites. Apart from a set of promoter sequences of co-regulated genes no other input data are needed. However, the structure of promoters driving expression in other tissues, such as the photoreceptor in C. intestinalis, has shown to be more challenging. Improvements to the model, such as the incorporation of additional information (e.g. the clustering of sites, the distance between pairs of sites, or evolutionary conservation) are likely to improve its prediction performance and are now being studied.
| Supplementary Data |
|---|
|
|
|---|
Supplementary data are available online at www.dnaresearch.oxfordjournals.org.
| Funding |
|---|
|
|
|---|
This work was supported by a Grant-in-Aid for Scientific Research from JSPS (17310114) and was supported in part by another grant (18370089) and by BIRD of Japan Science and Technology Agency (JST). A.V. is also supported by the Japanese Government Scholarship (Monbukagakusho; MEXT).
| Footnotes |
|---|
* To whom correspondence should be addressed. Tel. +81 3-5449-5131. Fax. +81 3-5449-5133. E-mail: knakai{at}ims.u-tokyo.ac.jp
| References |
|---|
|
|
|---|
- Stormo G. D., Schneider T. D., Gold L. M. Characterization of translational initiation sites in E. coli. Nucleic Acids Res (1982) 10:2971–2996.
[Abstract/Free Full Text] - Barash Y., Elidan G., Friedman N., et al. Modeling dependencies in protein-DNA binding sites. RECOMB 03 (2003) 28–37.
- Benos P. V., Bulyk M. L., Stormo G. D. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res (2002) 30:4442–4451.
[Abstract/Free Full Text] - Bulyk M. L., Johnson P. L., Church G. M. Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res (2002) 30:1255–1261.
[Abstract/Free Full Text] - Tomovic A., Oakeley E. J. Position dependencies in transcription factor binding sites. Bioinformatics (2007) 23:933–941.
[Abstract/Free Full Text] - Zhou Q., Liu J. S. Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics (2004) 20:909–916.
[Abstract/Free Full Text] - Aerts S., Van Loo P., Thijs G., et al. Computational detection of cis-regulatory modules. Bioinformatics (2003) 19(Suppl 2):ii5–14.[Abstract]
- Bailey T. L., Noble W. S. Searching for statistically significant regulatory modules. Bioinformatics (2003) 19(Suppl 2):ii16–25.[Abstract]
- Blanchette M., Bataille A. R., Chen X., et al. Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Res (2006) 16:656–668.
[Abstract/Free Full Text] - Frith M. C., Li M. C., Weng Z. Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res (2003) 31:3666–3668.
[Abstract/Free Full Text] - Philippakis A. A., He F. S., Bulyk M. L. Modulefinder: a tool for computational discovery of cis-regulatory modules. Pac. Symp. Biocomput (2005) 519–530.
- Sinha S., van Nimwegen E., Siggia E. D. A probabilistic method to detect regulatory modules. Bioinformatics (2003) 19(Suppl 1):i292–301.[Abstract]
- Zhao G., Schriefer L. A., Stormo G. D. Identification of muscle-specific regulatory modules in Caenorhabditis elegans. Genome Res (2007) 17:348–357.
[Abstract/Free Full Text] - Zhou Q., Wong W. H. CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl. Acad. Sci. USA (2004) 101:12114–12119.
[Abstract/Free Full Text] - Berendzen K. W., Stuber K., Harter K., et al. Cis-motifs upstream of the transcription and translation initiation sites are effectively revealed by their positional disequilibrium in eukaryote genomes using frequency distribution curves. BMC Bioinformatics (2006) 7:522.[CrossRef][Medline]
- Sharov A. A., Dudekula D. B., Ko M. S. CisView: a browser and database of cis-regulatory modules predicted in the mouse genome. DNA Res (2006) 13:123–134.
[Abstract/Free Full Text] - Terai G., Takagi T. Predicting rules on organization of cis-regulatory elements, taking the order of elements into account. Bioinformatics (2004) 20:1119–1128.
[Abstract/Free Full Text] - Chowdhary R., Tan S. L., Ali R. A., et al. Dragon Promoter Mapper (DPM): a Bayesian framework for modelling promoter structures. Bioinformatics (2006) 22:2310–2312.
[Abstract/Free Full Text] - Ohler U., Niemann H., Liao G., et al. Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics (2001) 17(Suppl 1):S199–206.[Abstract]
- Kusakabe T., Yoshida R., Ikeda Y., et al. Computational discovery of DNA motifs associated with cell type-specific gene expression in Ciona. Dev. Biol (2004) 276:563–580.[CrossRef][ISI][Medline]
- Johnson D. S., Zhou Q., Yagi K., et al. De novo discovery of a tissue-specific gene regulatory module in a chordate. Genome Res (2005) 15:1315–1324.
[Abstract/Free Full Text] - Yagi K., Takatori N., Satou Y., et al. Ci-Tbx6b and Ci-Tbx6c are key mediators of the maternal effect gene Ci-macho1 in muscle cell differentiation in Ciona intestinalis embryos. Dev. Biol (2005) 282:535–549.[CrossRef][ISI][Medline]
- Meedel T. H., Chang P., Yasuo H. Muscle development in Ciona intestinalis requires the b-HLH myogenic regulatory factor gene Ci-MRF. Dev. Biol (2007) 302:333–344.[CrossRef][ISI][Medline]
- Smith F. A., Hubley R., Green P. 1996–2004, Repeatmasker Open-3.0.
- Dehal P., Satou Y., Campbell R. K., et al. The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science (2002) 298:2157–2167.
[Abstract/Free Full Text] - Bailey T. L., Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol (1994) 2:28–36.[Medline]
- Pavesi G., Mauri G., Pesole G. An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics (2001) 17(Suppl 1):S207–214.[Abstract]
- Hughes J. D., Estep P. W., Tavazoie S., et al. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol (2000) 296:1205–1214.[CrossRef][ISI][Medline]
- Ben-Gal I., Shani A., Gohr A., et al. Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics (2005) 21:2657–2666.
[Abstract/Free Full Text] - Salzberg S. L., Delcher A. L., Kasif S., et al. Microbial gene identification using interpolated Markov models. Nucleic Acids Res (1998) 26:544–548.
[Abstract/Free Full Text] - Satou Y., Yamada L., Mochizuki Y., et al. A cDNA resource from the basal chordate Ciona intestinalis. Genesis (2002) 33:153–154.[CrossRef][ISI][Medline]
- Takimoto N., Kusakabe T., Horie T., et al. Distinct distribution of RPE65 and beta-carotene 15,15'-monooxygenase homologues in Ciona intestinalis. Photochem. Photobiol (2006) 82. Origin of the vertebrate visual cycle: III. 1468–1474.[Medline]
- Sandelin A., Alkema W., Engstrom P., et al. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res (2004) 32:D91–94.
[Abstract/Free Full Text] - Wasserman W. W., Fickett J. W. Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol (1998) 278:167–181.[CrossRef][ISI][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



