DNA Research Advance Access originally published online on October 21, 2008
DNA Research 2008 15(6):387-396; doi:10.1093/dnares/dsn027
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes
Advanced Science and Technology Research Group, Mitsubishi Research Institute, Inc., 2-3-6 Otemachi, Chiyoda-ku, Tokyo 100-8141, Japan
Received 19 August 2008; accepted 24 September 2008.
| Abstract |
|---|
|
|
|---|
Recent advances in DNA sequencers are accelerating genome sequencing, especially in microbes, and complete and draft genomes from various species have been sequenced in rapid succession. Here, we present a comprehensive gene prediction tool, the MetaGeneAnnotator (MGA), which precisely predicts all kinds of prokaryotic genes from a single or a set of anonymous genomic sequences having a variety of lengths. The MGA integrates statistical models of prophage genes, in addition to those of bacterial and archaeal genes, and also uses a self-training model from input sequences for predictions. As a result, the MGA sensitively detects not only typical genes but also atypical genes, such as horizontally transferred and prophage genes in a prokaryotic genome. In this paper, we also propose a novel approach for analyzing the ribosomal binding site (RBS), which enables us to detect species-specific patterns of the RBSs. The MGA has the ingenious RBS model based on this approach, and precisely predicts translation starts of genes. The MGA also succeeds in improving prediction accuracies for short sequences by using the adapted RBS models (96% sensitivity and 93% specificity for 700 bp fragments). These features of the MGA expedite wide ranges of microbial genome studies, such as genome annotations and metagenome analyses.
Key words: bioinformatics; gene-finding; prokaryote; phage; ribosomal binding site
| 1. Introduction |
|---|
|
|
|---|
Identification of genes on genomic sequences is the indispensable first step in every genome analysis, including individual genome analysis of a single organism and metagenomic analyses. Sequence similarity-based methods of gene predictions enable us to detect reliably the genes if their DNA or amino acid sequences have strong similarities to those of known genes. However, a significant portion of genes has no sequence similarities to known genes, and ab initio gene-finding methods are necessary for identifying all genes on newly sequenced microbial genomes, particularly those of uncharacterized or poorly characterized species. Computational gene finding from genomic sequences has a long history1
Although conventional gene-finding tools have achieved extremely high prediction performances, they have some critical limitations. Most conventional tools require predetermined statistical models of the known genes of a target species4
–11
or a long enough input sequence for statistical models to perform self-training12
–16
. This is because the tools are designed to predict genes on complete genomes having several million base pairs. However, a target genomic sequence is not always long enough. For example, second-generation DNA sequencers, which have put high throughput sequencing into practice, especially those of microbial genomes17
,18
, produce vast amounts of very short sequence reads. The short reads are assembled into some longer contig sequences, but the contigs are usually still short [far shorter than 1 mega bases (Mb)]19
–22
. A fosmid clone, which has
40 kb in insert length, is another example of a short genomic sequence. Moreover, metagenomic analyses produce large amounts of short sequences derived from multiple species genomes. Most of the conventional gene-finding tools cannot be applied to such sequences. MetaGene23
(MG) is one of the new tools that is applicable to gene prediction on such short anonymous sequences.
MG is a gene-finding program originally developed for metagenomic sequence data, which is a mixture of (short) sequences derived from various prokaryotic genomes. MG assumes correlations between the GC content and the di-codon frequencies of an input sequence, and enables us to predict genes accurately on short anonymous sequences without any training. MG can be successfully applied to wide varieties of prokaryotic genomic sequences24
–27
, but two major limitations exist: one is the lack of a ribosomal binding site (RBS) model, and the other is less sensitivity to atypical genes, whose codon usages are different from those of typical genes. When MG is applied to very short sequences containing one or two partial genes, these limitations are not significant. However, such limitations are undesirable when MG is applied to longer genomic sequences for precise annotations. To overcome these limitations and to improve the usability of the program, we developed a new version of the MG, the MetaGeneAnnotator (MGA). The MGA has statistical models of prophage genes and can automatically detect them in addition to chromosome backbone genes even when input genomic sequences have mosaic structures attributed to lateral gene transfers and/or phage infections. The MGA also has an adaptable the RBS model based on complementary sequences of the 3' tail of 16S ribosomal RNA, and precisely predicts translation starts of genes even when input genomic sequences are short and anonymous sequences. These features of the MGA remarkably improve prediction accuracies of genes on a wide range of prokaryotic genomes. Here, we report the results of a performance test of the MGA applied to various types of genomic sequences, such as complete genomes, plasmids and their subsequences of various lengths, under conditions of anonymity.
| 2. Materials and methods |
|---|
|
|
|---|
2.1. Construction of prophage gene model
In addition to the bacterial and archaeal gene models of MG23
2.2. Procedures for predicting typical and atypical genes
The self-training model for typical genes is constructed as follows. Initially, genes are predicted using an optimal set of the di-codon regression models (bacterial, archaeal or prophage models). Then, these predicted genes are used for the self-training of the di-codon statistics of typical genes. The self-training model is defined as the weighted averages of the di-codon frequencies derived from the predicted genes, and from the regression models used for the initial prediction. A di-codon frequency of the self-training model, fself, is defined by the frequency of the predicted genes, fpred, and of the regression models, freg, as follows:
|
| (1) |
After training, four sets (self, bacteria, archaea and prophage) of di-codon frequencies are applied for scoring candidate genes. Unlike the original MG algorithm, each open-reading frame (ORF) is individually scored according to its own GC content in this step to detect atypical genes. Typical genes are expected to score the highest mark with the self-training model, and atypical genes to score the highest mark with one of the other models. Then, a maximal scoring combination of genes is calculated as the definitive prediction. While this procedure (ORF-by-ORF) is sensitive to atypical genes, some more false-positives are included in the prediction. So, the ORF-by-ORF procedure is applied only to the sequences longer than 5000 bp (containing multiple genes). For shorter sequences, the conventional procedure, in which all ORFs are scored by one of the four sets of the di-codon models according to the GC content of the input sequence, is applied.
2.3. The RBS map analysis and the RBS model construction
We defined nine hexamers derived from the following sequence, which was complementary to a tail of 16S rRNA, as the potential RBS motifs: G(A/T)(A/T)AGGAGGT(G/A)ATC. Starting from the left, the motifs were named Motif-1, ..., Motif-9 [e.g. Motif-3 is (A/T)AGGAG]. An exact match or one-base mismatch sequence of the motifs was sought against an upstream region of a start codon, and the best match motif and location were determined. In the RBS map analysis (see below), upstream sequences of the annotated start codons range from –2 to –21 were used for analysis. In the RBS prediction model, upstream sequences of the predicted start codons (in the previous step) range from –3 to –19 were used for model construction and prediction. The detected sequences were considered to be representative RBSs of the species, and the proportion of genes having representative RBSs (an RBS ratio, wRBS) was stored for the use in scoring RBSs. Then, a two-dimensional frequency distribution of the representative RBSs was calculated to construct the RBS map. For the analysis, distances between the constructed RBS maps were defined by the Euclidean distance, and the neighbor joining method29
was applied to make clusters of the RBS maps. This RBS map analysis was performed using 591 annotated microbial genomes obtained from the RefSeq database (Supplementary Table S1). As the RBS prediction model, a position weight matrix (PWM) for each motif was constructed using the representative RBS sequences detected earlier. In the prediction process, the RBS scores for all candidate genes were calculated using the constructed PWMs and the frequency distributions of the positions. Here, the RBS score, SRBS, was heuristically weighted using a frequency of a motif m, wm, and the RBS ratio (wRBS) to reduce noise in less frequently used motifs.
![]() | (2) |
2.4. Performance evaluation
Prediction performances of gene-finders were evaluated using datasets, including the MetaGene dataset23
. The MetaGene dataset consists of nine bacterial and three archaeal genomic sequences (Supplementary Table S2). In addition to these complete sequences, their subsequences (1 Mb, 500, 100, 40, 10, 5, 3 and 1 kb, 700 and 100 bp sequences) having 1x genome coverage (i.e. the total length of the subsequences is equal to the complete genome size) were also used for the evaluation. These sequences were not used for constructing statistical models of the MGA. The ratios of true-positives, including partially matching predictions with correct reading frames, relative to all annotated genes (sensitivity) and to all predicted genes (specificity) were used as indices for the evaluation. In addition, sensitivity to the start codons, in which only exactly matching predictions were counted as true-positives, was also utilized.
| 3. Results and discussion |
|---|
|
|
|---|
3.1. Predicting prophage genes
The MGA is based on the algorithm of the MG and utilizes logistic regression models of the GC content and the di-codon frequencies23
|
Because most prophage genes have codon frequencies similar to those of bacteria and archaea, MG (and probably other prokaryotic gene finders as well) can predict prophage genes with relatively high accuracies (Supplementary Table S3). However, Fig. 2A and B shows that certain other (non-codon) properties of prophage genes are different from those of prokaryotic genes: prophage genes are generally shorter (
660 bp in average) than bacterial and archaeal genes (
940 bp in average), and most genes are organized in tandem (>90%). This means that gene densities are higher in prophage genomes than in prokaryotic genomes, and most genes are packed in a few operons. These observations and statistics, in addition to the prophage di-codon models, are utilized to predict prophage genes. As a result, the sensitivities of the MGA to prophage genes are remarkably improved (from 88 to 93%) without any decrease in specificity (90%) (Supplementary Table S3).
|
3.2. Predicting atypical genes
MG predicts genes using the di-codon frequencies (and other parameters) estimated by the GC content of an input genomic sequence. That is to say, all genes in the same genomic sequence are predicted by the same set of di-codon frequencies. In this procedure, typical genes can be accurately and specifically predicted, but atypical genes, such as horizontally transferred and prophage genes, cannot be detected because their di-codon frequencies are different from those of typical genes. To overcome this limitation, we employ an ORF-by-ORF procedure, in which each candidate ORF is treated as an individual anonymous sequence (Fig. 1B). This procedure assumes that every ORF has a potentially different origin and contributes to improving the sensitivities of the MGA to atypical genes.
To predict properly the typical genes under the ORF-by-ORF procedure, we arranged a self-training model of di-codon frequencies in addition to the logistic regression models (Fig. 1A). In the self-training model, di-codon frequencies are calculated from the initially predicted genes using the conventional scoring procedure of the MG, and then the weighted averages of di-codon frequencies derived from the predicted genes and from the regression models are calculated as the di-codon frequencies of typical genes. The self-training model fits well to typical genes compared with the regression models, and improves both sensitivity and specificity of the MGA to typical genes.
To evaluate the effectiveness of these procedures, prediction performances were tested on the chromosome and plasmid of enterohemorrhagic Escherichia coli O157:H7 strain Sakai30
,31
(Supplementary Table S4). Sensitivities of the MGA are extremely higher than those of the MG, especially in S-loops, which are O157:H7 strain-specific regions identified from comparisons with the E. coli K12 genome and that contain many horizontally acquired virulence-related genes. Higher sensitivities are also observed for a large virulence plasmid (pO157). Specificities of the MGA are slightly lower than those of MG, but are still higher than those of GeneMarkS16
and GeneMark.heuristics32
. These results indicate that our ORF-by-ORF procedure works well for predicting atypical genes and can be applied to genomes having mosaic structures with high specificity.
3.3. Analyzing species-specific patterns of the RBS
The other notable feature of the MGA is an adaptable model of the RBS. An RBS, which is also known as the Shine-Dalgarno (SD) sequence33
, is located on the 5' flanking region of the start codon, and interacts with a part of the 3' end of 16S ribosomal RNA (rRNA) to control translation initiations of the gene. Although RBSs are complementary to the 3' tail of the 16S rRNA in every organism, their sequences (motifs) and preferred locations relative to start codons (or spacer lengths) differ slightly from organism to organism. In gene-finding programs, the Gibbs sampling algorithm is widely used for training the motifs and the spacer length distribution of the target species RBSs16
,17
, although this algorithm takes no thought for the observation that the RBSs are complementary to the tail of the 16S rRNA. This approach basically assumes one motif and one frequency distribution of the spacer lengths in each species. However, our analysis suggests that this assumption is not appropriate for most species.
We examined the upstream sequences of annotated genes from 229 prokaryotic genomes and constructed RBS maps that show a two-dimensional frequency distribution of the best match motif (out of the nine candidate motifs we suggested) and the spacer lengths of the RBSs for each species. The average RBS map (Fig. 3) shows that Motif-3 is most frequently used, but all nine motifs are potential RBSs. The higher the motif number, the shorter the spacer lengths. This is reasonable because it means that the position of the main body of the 16S rRNA is fixed even if the hybridization position of 16S rRNA tail is moved.
|
The observed patterns of the RBS maps vary from organism to organism, while phylogenetically related species show similar patterns (Figs 4 and 5). Although some species such as Helicobacter pylori (Fig. 5A) and Buchnera aphidicola, predominantly use Motif-2 and -3 and are therefore congruous with the one motif assumption described earlier many other species show broader distributions. For example, some Firmicutes, including Clostridium (Fig. 5B), and Thermotogae indicate broad and clear patterns of the RBS maps. Some archaea, including methanogens (Fig. 5C), also indicate broad patterns, but the preferred motifs are different between these bacteria and archaea (e.g. Clostridium acetobutylicum prefers Motif-3 and -4, but Methanobrevibacter smithii prefers Motif-8.). Overall, bacterial species tend to prefer motifs of 3' side of a tail of 16S rRNA, while archaeal ones tend to prefer motifs of 5' side of the tail. Only very weak signals of the RBS motifs are found in some species belonging to Bacteroidetes and Cyanobacteria (Fig. 5D). In these species, no other significant motif is found. These results suggest that our RBS map with nine fixed motifs is effective for capturing the species-specific pattern of the RBSs. Hence, we used this two-dimensional frequency distribution and the PWMs of the nine RBS motifs as an RBS model of the MGA. Parameters of the RBS model are estimated from upstream sequences of predicted genes. To predict the RBSs on very short input sequences (having no training data), a general model of the RBS was manually constructed, based on the average RBS map and was integrated into the MGA (Fig. 1A).
|
|
3.4. Prediction performances on long genomic sequences
The prediction performances of the MGA and conventional gene-finding tools based on unsupervised learning, such as GeneMarkS16
|
For complete genomes and 1 Mb subsequences, all prediction tools indicate almost identical sensitivities (
97%), while specificity is significantly higher in the MGA (93%) compared with the others (90% in GeneMarkS and 86–87% in Glimmer3). In other words, the sensitivities of the MGA are potentially higher than the others at the same specificity level. Sensitivities to start codons are also identical in the MGA (78%) and GeneMarkS (77%), but Glimmer3 shows lower values (72–75%), although both GeneMarkS and Glimmer3 utilize the Gibbs sampling procedure to train their RBS models. In contrast, the mean sensitivity to start codons in Glimmer3 is better than that in GeneMarkS on the other dataset (Table 1), which consists of six complete genomes (one archaea and five bacteria) having relatively broad distributions of the RBS maps. The performance of the MGA is stable and exceeds the others also in this dataset, especially in Clostridium acetobutylicum. In comparison with the original MG (Fig. 6B), the MGA remarkably improves sensitivities to both genes and start codons without reducing specificities. These results indicate that our simple RBS model works well for detecting various types of the RBS.
|
3.5. Prediction performances on short genomic sequences
For sequences shorter than 1 Mb, the MGA retained high accuracies in every index (Fig. 6A). Both sensitivities and specificities of Glimmer3 are relatively high when input sequences are longer than 40 kb, but the performance of the start codon prediction is rapidly degraded as the input sequences become shorter. This is because the Gibbs sampling algorithm requires a significant number of positive (RBS) sequences to detect the correct motif. A 40 kb-sequence has <40 genes (or RBSs) on average, and the sensitivity to start codons declines to 57% in Glimmer3. GeneMarkS does not accept a shorter input sequence than 1 Mb, probably because it has the same weak point as Glimmer3. Unlike the RBS models of these tools, the MGA assumes only nine hexamers as candidate's RBSs, and relatively few sequences are needed to estimate the parameters of the RBS model. As a result, the MGA requires just 500 kb (or
500 genes) to adapt the RBS model fully to the input sequence, and its sensitivity to start codons is sufficiently high (75%) even in 40 kb sequences. Furthermore, Fig. 6B and Table 2 show that the general RBS model of the MGA also works well for predicting the start codons of genes on very short sequences. Although most genes on 700 bp-subsequences lack their 5' sequences (including start codon and RBSs), the results also indicate that the RBS model contributes to improving prediction specificities by deselecting false-positive translation starts.
|
3.6. Advantage of self-training using a set of genomic sequences
If multiple (short) input sequences can be assumed as the genomic sequences of the same species, prediction accuracies on the sequences are improved by self-training of the models as well as on a long-genomic sequence (the MGA-s in Fig. 6B and C). Fig. 6C shows the relationships between prediction accuracies and the number of 40 kb-sequences treated as a unit. Fig. 6C also suggests that a total of about 500 kb (10–20x40 kb) are needed for full adaptation of the RBS model, but the prediction accuracies steadily improve if the number of input sequences are increased. When a sufficient amount of sequences are available, the MGA provides prediction performances comparable to the complete genome analyses, even if each sequence is very short (Table 2). So, if multiple contig sequences are obtained by sequencing a single species genome, or if metagenomic sequences are phylogenetically classified into groups using some classification methods34
3.7. Conclusion
As mentioned, the MGA successfully overcomes the limitations of the MG, and archives high prediction accuracies especially in the start codon predictions. Although some gene-finding tools advocating high sensitivity to start codons, such as GeneMarkS and Glimmer3 tend to sacrifice specificities for improving sensitivities, the RBS model of the MGA enables the sensitive detection of start codons without reducing specificities. Our RBS model is based on previous knowledge about the RBS and 16S rRNA, and requires little training data for estimating its parameters. As a result, the MGA can precisely predict genes even on short genomic sequences unlike the other tools. Both typical and atypical genes can be sensitively and precisely detected while keeping high specificity. The MGA can detect not only chromosome backbone genes but also prophage genes and provides a complete set of genes on a genomic sequence. The MGA also provides information about the selected di-codon model (bacteria, archaea, prophage or self) for predicting each gene, and the information is helpful for further analyses of genes because it reflects statistical differences among the genes.
In addition to the precise prediction ability of the MGA, the RBS map analysis proposed here is helpful for genome annotations and is useful for analyzing translation initiation mechanisms and their evolutions. It is important for annotators to comprehend a specific RBS pattern of a target species and its related species. The MGA can automatically extract the pattern, and outputs information on RBSs in addition to location information on genes. We believe that the MGA accelerates not only metagenomic analyses but also the annotation processes of all kinds of prokaryotic and phage genomes.
| Availability |
|---|
|
|
|---|
MetaGeneAnnotator are freely available at http://metagene.cb.k.u-tokyo.ac.jp.
| Supplementary Data |
|---|
|
|
|---|
Supplementary Data is available online at www.dnaresearch.oxfordjournals.org.
| Acknowledgements |
|---|
|
|
|---|
The original MetaGene was developed at Toshihisa Takagi laboratory (University of Tokyo). We thank Prof Tetsuya Hayashi (Miyazaki University) and Prof Ken Kurokawa (Tokyo Institute of Technology) for stimulating discussions.
| Footnotes |
|---|
* To whom correspondence should be addressed. Tel. +81 3-3277-0556. Fax. +81 3-3277-0568. E-mail: nog{at}mri.co.jp
| References |
|---|
|
|
|---|
- Fickett J. W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. (1981) 10:5303–5318.[CrossRef][Web of Science]
- Gribskov M., Devereux J., Burgess R. R. The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res. (1984) 12:539–549.
[Abstract/Free Full Text] - Staden R. Measurements of the effects of that coding for a protein has on a DNA sequence and their use for finding genes. Nucleic Acids Res. (1984) 12:551–567.
[Abstract/Free Full Text] - Borodovsky M. Y., Sprizhitskii Y. A., Golovanov E. I., Aleksandrov A. A. Statistical patterns in primary structures of functional regions in the E. coli genome: III. Computer recognition of coding regions. Mol. Biol. (1986) 20:1145–1150.
- Borodovsky M. Y., McIninch J. D. GeneMark: parallel gene recognition for both DNA strands. Comput. Chem. (1993) 17:123–153.[CrossRef]
- Krogh A., Mian I. S., Haussler D. A hidden Markov model that finds genes in E.coli DNA. Nucleic Acids Res. (1994) 22:4768–4778.
[Abstract/Free Full Text] - Salzberg S. L., Delcher A. L., Kasif S., White O. Microbial gene identification using interpolated Markov model. Nucleic Acids Res. (1998) 26:544–548.
[Abstract/Free Full Text] - Lukashin A. V., Borodovsky M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. (1998) 26:1107–1115.
[Abstract/Free Full Text] - Delcher A. L., Harmon D., Kasif S., White O., Salzberg S. L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. (1999) 27:4636–4641.
[Abstract/Free Full Text] - Yada T., Nakao M., Totoki Y., Nakai K. Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov model. Bioinformatics (1999) 15:987–993.
[Abstract/Free Full Text] - Yada T., Totoki Y., Takagi T., Nakai K. A novel bacterial gene-finding system with improved accuracy in locating start codons. DNA Res. (2001) 8:97–106.[Abstract]
- Hayes W. S., Borodovsky M. How to interpret an anonymous bacterial genome: machine learning approach to gene identification. Genome Res. (1998) 8:1154–1171.
[Abstract/Free Full Text] - Audic S., Claverie J. M. Self-identification of protein-coding regions in microbial genomes. Proc. Natl. Acad. Sci. U. S. A. (1998) 95:10026–10031.
[Abstract/Free Full Text] - Baldi P. On the convergence of a clustering algorithm for protein-coding regions in microbial genomes. Bioinformatics (2000) 16:367–371.
[Abstract/Free Full Text] - Besemer J., Lomsadze A., Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Nucleic Acids Res. (2001) 29:2607–2618.
[Abstract/Free Full Text] - Delcher A. L., Bratke K. A., Powers E. C., Salzberg S. L. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics (2007) 23:673–679.
[Abstract/Free Full Text] - Schuster S. C. Next-generation sequencing transforms todays biology. Nat. Methods (2008) 5:16–18.[CrossRef][Web of Science][Medline]
- Hall N. Advanced sequencing technologies and their wider impact in microbiology. J. Exp. Biol. (2007) 210:1518–1525.
[Abstract/Free Full Text] - Dohm J. C., Lottaz C., Borodina T., Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. (2007) 17:1697.
[Abstract/Free Full Text] - Hernandez D., François P., Farinelli P., Østerås M., Schrenzel J. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. (2008) 18:802.
[Abstract/Free Full Text] - Butler J., MacCallum I., Kleber M., et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. (2008) 18:810.
[Abstract/Free Full Text] - Zerbino B. R., Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graph. Genome Res. (2008) 18:821.
[Abstract/Free Full Text] - Noguchi H., Park J., Takagi T. MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res. (2006) 34:5623–5630.
[Abstract/Free Full Text] - Kurokawa K., Itoh T., Kuwahara T., et al. Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Res. (2007) 14:169–181.
[Abstract/Free Full Text] - Raes J., Foerstner K. U., Bork P. Get the most out of your metagenome: computational analysis of environmental sequence data. Curr. Opin. Microbiol. (2007) 10:490–498.[CrossRef][Web of Science][Medline]
- Schmeisser C., Steele H., Streit W. R. Metagenomics, biotechnology with non-culturable microbes. Appl. Microbiol. Biotechnol. (2007) 75:955–962.[CrossRef][Web of Science][Medline]
- Pop M., Salzberg S. L. Bioinformatics challenges of new sequencing technology. Trends Genet. (2008) 24:142–149.[Web of Science][Medline]
- Pruitt K. D., Tatusova T., Maglott D. R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. (2007) 35:D61–65.
[Abstract/Free Full Text] - Saitou N., Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. (1987) 4:406–425.[Abstract]
- Hayashi T., Makino K., Ohnishi M., et al. Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Res. (2001) 8:11–22.[Abstract]
- Ohnishi M., Kurokawa K., Hayashi T. Diversification of Escherichia coli genomes: are bacteriophages the major contributors? Trens Microbiol. (2001) 9:481–485.[CrossRef]
- Besemer J., Borodovsky M. Heuristic approach to deriving models for gene finding. Nucleic Acids Res. (1999) 27:3911–3920.
[Abstract/Free Full Text] - Shine J., Dalgarno L. The 3'-terminal sequence of Escherichia coli 16S ribosomal RNA: Complementary to nonsense triplets and ribosome binding sites. Proc. Natl. Acad. Sci. U. S. A. (1974) 71:1342–1346.
[Abstract/Free Full Text] - McHardy A. C., Martin H. G., Tsirigos A., Hugenholtz P., Rigoutsos I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods (2007) 4:63–72.[CrossRef][Medline]
- Krause L., Diaz N. N., Goesmann A., et al. Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res. (2008) 36:2230–2239.
[Abstract/Free Full Text]
This article has been cited by other articles:
![]() |
G.-Q. Hu, J.-T. Guo, Y.-C. Liu, and H. Zhu MetaTISA: Metagenomic Translation Initiation Site Annotator for improving gene start prediction Bioinformatics, July 15, 2009; 25(14): 1843 - 1845. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. J. Hoff, T. Lingner, P. Meinicke, and M. Tech Orphelia: predicting genes in metagenomic sequencing reads Nucleic Acids Res., July 1, 2009; 37(suppl_2): W101 - W105. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








