DNA Research Advance Access originally published online on January 10, 2006
DNA Research 2005 12(5):281-290; doi:10.1093/dnares/dsi015
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Novel Phylogenetic Studies of Genomic Sequence Fragments Derived from Uncultured Microbe Mixtures in Environmental and Clinical Samples
1Center for Information Biology, National Institute of Genetics, The Graduate University for Advanced Studies (Sokendai) Mishima, Shizuoka 411-8540, Japan
2Department of Bio-System Engineering, Faculty of Engineering, Yamagata University Yonezawa, Yamagata 992-8510, Japan
3Department of Bioinformatics and Genomes, Graduate School of Information Science, Nara Institute of Science and Technology Takayama, Ikoma, Nara 630-0101, Japan
4The Graduate University for Advanced Studies (Sokendai), Hayama Center for Advanced Research Hayama-cho, Kanagawa 240-0193, Japan
Received 11 July 2005; revised 2 September 2005
| Abstract |
|---|
|
|
|---|
A self-organizing map (SOM) was developed as a novel bioinformatics strategy for phylogenetic classification of sequence fragments obtained from pooled genome samples of uncultured microbes in environmental and clinical samples. This phylogenetic classification was possible without either orthologous sequence sets or sequence alignments. We first constructed SOMs for tetranucleotide frequencies in 210 000 5 kb sequence fragments obtained from 1502 prokaryotes for which at least 10 kb of genomic sequence has been deposited in public DNA databases. The sequences could be classified primarily according to phylogenetic groups without information regarding the species. We used the SOM method to classify sequence fragments derived from environmental samples of the Sargasso Sea and of an acidophilic biofilm growing in acid mine drainage. Phylogenetic diversity of the environmental sequences was effectively visualized on a single map. Sequences that were derived from a single genome but cloned independently could be reassociated in silico. G + C% has been used for a long period as a fundamental parameter for phylogenetic classification of microbes, but the G + C% is apparently too simple a parameter to differentiate a wide variety of known species. Oligonucleotide frequency can be used to distinguish the species because oligonucleotide frequencies vary significantly among their genomes.
Key words: self-organizing map; environmental samples; metagenome; phylogenetic classification; SOM
| 1. Introduction |
|---|
|
|
|---|
Most environmental microorganisms cannot be cultured easily under laboratory conditions. Genomes of uncultured organisms have remained mostly uncharacterized and are thought to contain a wide range of novel genes of scientific and industrial interest.1
| 2. Materials and Methods |
|---|
|
|
|---|
2.1. SOM methods
We previously modified the conventional Kohonen's SOM7
2/
1) x I.
1 and
2 were the standard deviations of the first and second principal components, respectively. Weight vectors (wij) were set and updated as described previously.9
|
2.2. Nucleotide sequences
Nucleotide sequences were obtained from http://www.ddbj.nig.ac.jp/anoftp-e.html. When the number of undetermined nucleotides (Ns) in a sequence exceeded 10% of the window size, the sequence was omitted from the analysis. When the number of Ns was <10%, the oligonucleotide frequencies were normalized to the length without Ns and included in the analysis.
| 3. Results |
|---|
|
|
|---|
3.1. SOMs for species-known prokaryotes
To investigate the clustering power of SOM for a wide range of prokaryotic sequences, we first analyzed 81 prokaryotic genomes for which complete sequence is available (a total of 226 Mb). To avoid overrepresentation of specific genomes, one genome among different strains of one species or of closely related species was used. SOMs were constructed with tri- and tetranucleotide frequencies for 226 000 non-overlapping 1 kb and overlapping 5 kb sequences with a 1 kb sliding step. These short sequence fragments were tested because such short fragments can be cloned even from a small quantity of DNA samples difficult to be obtained, e.g. those from extreme environments. To define the initial weight vectors, tri- and tetranucleotide frequencies in the 226 000 sequences were analyzed by PCA as described previously.9
|
SOM recognized in sequence fragments the key combinations of short oligonucleotide frequencies that are the signature feature of each genome12
3.2. Diagnostic oligonucleotides for species separations
Analysis of the weight vectors of individual lattice points in Fig. 1 revealed that vectors with strongly biased frequencies were located on the edge of the map. The G + C% calculated from each lattice vector was reflected in the horizontal axis and increased from left to right in the 5 kb DegeTetra-SOM (G + C% in Fig. 2). In other words, sequences with high G + C% (red), which were derived primarily from GC-rich genomes, were clustered on the right side of the map. To visualize the species-specific patterns of oligonucleotide frequencies recognized by the SOM, the frequency of each pair of complementary tetranucleotides at each lattice point was calculated, divided into five level categories with an equal number of lattice points, and visualized with the level of red or blue; seven examples of the diagnostic patterns of the species separations are presented in Fig. 2. Transitions between red (overrepresentation) and blue (underrepresentation) for various tetranucleotides often coincided with borders for species separations. To clarify the biological implications of diagnostic tetranucleotides for the species separations, we examined association between the levels of palindromic tetranucleotides and restriction enzyme systems. We could identify 24 four-base cutter enzymes for the 81 prokaryotes analyzed, as referred to the REBASE restriction enzyme database (http://rebase.neb.com/rebase/rebase.html). The restriction site tetranucleotides were underrepresented for 23 of 24 enzymes (refer to blue zones noted with light blue arrows in palindromic tetranucleotide panels in Fig. 2). Underrepresentation of the restriction site tetranucleotides in the genome with the enzyme gene may be related to the potential danger for self-damage even in the presence of methylation systems. This result showed that the SOM effectively classified sequences according to biological categories, and a complex combination of many oligonucleotides should contribute to species separation. For example, wide varieties of oligonucleotide sequences are known to function as genetic signals (e.g. regulatory signals for gene expression), and some of the important signal sequences may be biased significantly toward non-random occurrence and presumably be diagnostic for species separation.
|
3.3. Phylogenetic classification of species-known sequences
In the phylogenetic classification of unculturable and poorly characterized species, classification into phylotypes rather than into individual species is important. Therefore, focusing on the DegeTetra-SOMs in Fig. 1A, we examined the classification of sequences from the 81 prokaryotes into 13 major phylotypes by referring to the DNA Data Bank of Japan (DDBJ) Taxonomy Database (http://sakura.ddbj.nig.ac.jp/uniTax.html). Lattice points that contained sequences from one phylotype are indicated in color, and those that included sequences from more than one phylotype are shown in black (Fig. 1B). The number of black points was lower than that of the species classification. On the 5 kb DegeTetra-SOM, 88% of the sequences were classified into the correct phylotype territory (Table 1), which was defined as a continuous territory represented by a single color for the phylotype. Species that showed close phylogenetic relations in one phylotype tended to have neighboring zones and produced a continuous territory in the phylotype classification. It should also be noted that different species belonging to one phylotype were often separated from each other. One cause of this separation was related to differences in genome G + C% between species of a single phylotype. For example, we observed a separation between GC- and AT-rich
-proteobacteria (It is worth noting that paired-end reads in DNA sequencing, such as two 500 nt sequences from one cloned fragment, can be used as a single 1 kb sequence in the calculation of oligonucleotide frequencies. When oligonucleotide frequencies were calculated and normalized for sequence length, 1 kb sequences could be mapped on the 5 kb SOM (1 kb on 5 kb SOM in Table 1). The proportions of 1 kb sequences classified into the correct phylotype were 69 and 85% on the 1 and 5 kb DegeTetra-SOMs, respectively, indicating that for phylogenetic classification of species-unknown sequences, even short sequences (e.g. 1 kb) must be mapped on an SOM constructed with longer species-known sequences (e.g. 5 kb). The increased hit level in mapping on the 5 kb SOM is presumably because the SOM can extract species-specific characteristics of oligonucleotide frequencies more accurately and because the statistical fluctuation of oligonucleotide frequencies decreases as the analyzed sequences become longer.
For phylogenetic classification of novel sequences from environmental samples, SOMs should be constructed in advance with all available sequences of known species rather than only those from completely sequenced genomes. We thus constructed a DegeTetra-SOM with 210 000 non-overlapping 5 kb sequences (a total of 1.05 Gb) from 1502 species-known prokaryotes for which at least 10 kb of sequence has been deposited in DDBJ/EMBL/GenBank. These 1502 prokaryotes were classified into 25 phylotypes with reference to the NCBI Taxonomy Database. The classification according to phylotype was apparent (Species-known Seq. in Fig. 3A).
|
3.4. Phylogenetic classification of environmental sequences
One goal of metagenomic studies is to characterize and hopefully reconstruct multiple genomes, at least for dominant species in an environment, by sequencing a large number of DNA fragments obtained from an environmental sample. Such an approach should allow extensive surveys of sequences useful in scientific and industrial applications and assist in developing accurate views of the ecology of uncultured microbes. The traditional homology-based phylogenetic classification methods, however, have inevitably focused on well-characterized genes such as rDNA, for which orthologs from a wide range of phylotypes are available, but most well-characterized genes, including rDNA, are not industrially attractive. It would be best if microbial diversity could be assessed during the process of screening for novel genes with industrial and scientific significance. We developed such a method.
Venter et al.11
applied shotgun sequencing to mixed genomes collected from the Sargasso Sea near Bermuda and deposited
811 000 sequence fragments (a total of 1 Gb) in DDBJ/EMBL/GenBank. These environmental sequences were analyzed with the newly developed SOM method. A majority of Sargasso sequence entries in the database corresponded to one-path read sequences; however, a limited number represented contigs assembled from multiple sequences derived from dominant genomes in this environment. We first selected 4258 entries of the contigs longer than 5 kb and mapped 34 000 fragments of 1 kb derived from these contigs on the aforementioned SOM constructed for the 1502 known prokaryotes (Sargasso Seq. >5 kb in Fig. 3A). Distinct clusters of Sargasso sequences were observed, and all dominant species reported previously11
were included in these clusters. We then mapped 218 400 fragments of 1 kb derived from 134 600 entries longer than 1 kb and all 811 000 sequences (Sargasso Seq. >1 kb and All Sargasso Seq. 3D, respectively). Although the sequences were not clustered tightly, skewed distributions were observed. This is more easily visualized with a 3D representation in which the number of Sargasso sequences classified into each lattice point is indicated by the height of a bar. Zones with abundant Sargasso sequences were located at the bottom left of the SOM, which is an area that contains AT-rich sequences. In the respective zone, the sizes of individual territories of known phylotypes were relatively small, and the territories were complex (Species-known Seq. in Fig. 3A), indicating that sequences from various poorly characterized species were crowded there. The A + T-rich Sargasso sequences derived from novel species might be mapped there because of the high heterogeneity for the species-known sequences in the respective zones.
To investigate this possibility, we next constructed a DegeTetra-SOM with the species-known sequences plus Sargasso sequences longer than 1 kb. In addition, to confirm in silico reassociation of sequences derived from a single genome but cloned independently in a metagenomic library, we included sequence fragments derived from microbe mixtures in an acidophilic biofilm growing in acid mine drainage16
which is a worldwide environmental problem. Because of low-complexity of the mixed genomes, Tyson et al.16
focused on this biofilm in order to reconstruct dominant genomes with shotgun sequencing of the metagenomic library. In the Species-known and environmental Seq. panel in Fig. 3B, lattice points that contained only Sargasso or biofilm sequences are indicated by (
) or (
), respectively; those that contained sequences from a single known phylotype are colored as described in Fig. 3A; and those that included sequences from more than one phylotype or both environmental and species-known sequences are shown in black. Most of the biofilm sequences derived from the low-complexity metagenome library were located in a few distinct territories (Biofilm Seq. in Fig. 3B), confirming that most sequence fragments derived from a single genome but cloned independently can be reassociated in silico. This can provide the rationale for phylogenetic classification of sequences derived from a high-complexity library such as the Sargasso sequences. The territories of species-known sequences in Fig. 3B shrank appreciably when compared with those in Fig. 3A, indicating that a large portion of Sargasso sequences had oligonucleotide frequencies distinct from those of species-known prokaryotes. The number of Sargasso sequences classified into lattice points containing no species-known sequences is indicated by the height of the (
) bar (Sargasso Seq. unclassified in Fig. 3B); 79% of the Sargasso sequences belonged to this phylotype-unclassified category. These Sargasso sequences should correspond to sequences derived from genomes that are poorly characterized at the present moment. The remaining 21% of Sargasso sequences were associated with species-known sequences (black lattices in Species-known and environmental Seq. in Fig. 3B). The number of Sargasso sequences classified into lattice points containing sequences only from a single phylotype regarding species-known sequences is indicated by the height of the bar colored to indicate the phylotype (Sargasso Seq. classified). On our Web page (http://lavender.genes.nig.ac.jp/takaabe/DNAres/SPT1.htm), 92 genera, whose sequences were associated with the Sargasso sequences, are listed together with numbers of the associated Sargasso sequences. It should be noted that most of the Sargasso sequences have not been characterized phylogenetically because there were no orthologs for these novel sequences. Detailed phylotype predictions for individual Sargasso sequences with the SOM method are available on our Web page (http://lavender.genes.nig.ac.jp/takaabe/DNAres/SPT2.htm).
| 4. Discussion |
|---|
|
|
|---|
G + C% has been used for a long period as a fundamental parameter for phylogenetic classification of microbes, but the G + C% is apparently too simple a parameter to differentiate a wide variety of known species. Oligonucleotide frequency can be used to distinguish genomes because oligonucleotide frequencies vary significantly among genomes.12
When we consider phylogenetic classification of sequences from uncultured microbes in a very complex ecosystem, it will become important to construct SOMs in advance with all prokaryotic and eukaryotic sequences available. For example, a certain portion of Sargasso sequences may be derived from eukaryotic genomes such as those of fungi, protozoa and fishes. Furthermore, when microorganisms that have symbiotic/parasitic relation with a macroorganism are studied, sequences from the macroorganism may be included in the sequence collection. Specimens in medical studies may also contain various eukaryotic DNAs. To test the SOM separation of prokaryotic and eukaryotic sequences with special reference to Sargasso sequences, we constructed a DegeTetra-SOM with 5 kb sequences from the 1502 prokaryotes plus those from 11 unicellular eukaryotes (6 fungi and 5 protozoa) and zebrafish (Fig. 4A). The power of SOM to separate prokaryotic from eukaryotic sequences was very high; only 0.1% prokaryotic sequences were classified into eukaryotic territories. Furthermore, separation among eukaryotic species was apparent (Eukaryote Seq. separately colored). Next we mapped Sargasso sequences longer than 1 kb on this SOM (Sargasso Seq. mapped). Although a major portion of the Sargasso sequences were classified into specific prokaryote territories, 9.9% were classified into eukaryotic territories that corresponded mainly to territories of unicellular eukaryotes; 13 times more densely in unicellular eukaryote territories than in the zebrafish territory.
|
Genome segments introduced through horizontal gene transfer from a phylogenetically distant genome tend to retain the sequence characteristics of the donor genome and can be distinguished from those of the host genome. Even in the 5 kb DegeTetra-SOM in Fig. 1B, there are lattice points marked in black which should contain sequences with tetranucleotide frequencies distinct from the major portion of the host genome. Such sequences may correspond, at least in part, to segments transferred horizontally from a distant phylotype. When we investigated Bacillus subtilis sequences located outside the Firmicutes territory, we found that many were derived from pathogenic islands where prophage- and foreign-type sequences are clustered.19
When we focused on 16S rDNA sequences from species-unidentified prokaryotes in GenBank, we found 16 652 sequences longer than 1 kb. Because the main purpose of 16S rDNA sequencing of the unidentified prokaryotes in environments is phylogenetic assignment, phylotypes assigned with the homology-based method are often annotated in GenBank. We selected 6855 sequences for which classification into major phylotypes is annotated and mapped these sequences on the species-known 16S rDNA Tetra-SOM (Species-unknown 16S rDNA in Fig. 4B). Lattice points that contained sequences of a single phylotype annotated are indicated in color representing the phylotype, and those that included sequences from more than one phylotype are indicated in black. The major zones were marked in color, and the color pattern was almost identical to that of known species. Although the SOM does not require sequence alignments, assignments were practically identical to those obtained by the homology-based method for this case where a large number of orthologous sequences are available. When extensive shotgun sequencing of an environmental sample is done, substantial amounts of both non-rDNA and rDNA sequences become available. Combined SOM analyses of rDNA and non-rDNA sequences may provide detailed information regarding microbial diversity on the basis of detailed knowledge of rDNA phylogeny. This may also solve, at least in part, complications caused by horizontal transfer of sequences. For example, when no rDNA sequences are found in a certain phylotype territory in the rDNA SOM, but a statistically significant number of non-rDNA sequences is found in the respective phylotype territory in the SOM constructed with usual genomic sequences, these non-rDNA sequences may be candidates for horizontally transferred sequences. When a statistically significant number of rDNA sequences map to a certain phylotype territory in the rDNA SOM and is assigned to be derived from one species with the homology-based method on the basis of the detailed rDNA phylogeny, a major portion of the non-rDNA sequences mapped to this phylotype territory on the standard SOM can be predicted to be derived from the respective species. Collectively, the combination of SOMs for rDNA and non-rDNA sequences will reconstruct in silico the rDNA and non-rDNA sequences derived from a single genome but cloned independently.
| Supplementary material |
|---|
|
|
|---|
Supplementary material is available online at http://dnaresearch.oxfordjournals.org
| Acknowledgements |
|---|
|
|
|---|
This work was supported by grants from ACT-Japan Science and Technology Corporation and the Advanced and Innovational Research Program in Life Sciences and a Grant-in-Aid for Scientific Research on Priority Areas Applied Genome from the Ministry of Education, Culture, Sports, Science and Technology of Japan. A part of the calculation was done with the Earth Simulator of Japan Agency of Marine-Earth Science and Technology.
| Footnotes |
|---|
*To whom correspondence should be addressed. Tel. +81-55-981-6835, Fax. +81-55-981-6896, Email: takaabe{at}genes.nig.ac.jp
| References |
|---|
|
|
|---|
- Amann, R. I., Ludwig, W., Schleifer, K. H. 1995, Phylogenetic identification and in situ detection of individual microbial cells without cultivation, Microbiol. Rev., 59, 143169.
[Abstract/Free Full Text] - Hugenholtz, P. and Pace, N. R. 1996, Identifying microbial diversity in the natural environment: a molecular phylogenetic approach, Trends Biotechnol., 14, 190197.[CrossRef][Web of Science][Medline]
- Rondon, M. R., August, P. R., Bettermann, A. D., et al. 2000, Cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms, Appl. Environ. Microbiol., 66, 25412547.
[Abstract/Free Full Text] - Lorenz, P., Liebeton, K., Niehaus, F., Eck, J. 2002, Screening for novel enzymes for biocatalytic processes: accessing the metagenome as a resource of novel functional sequence space, Curr. Opin. Biotechnol., 13, 572577.[CrossRef][Web of Science][Medline]
- DeLong, E. F. 2002, Microbial population genomics and ecology, Curr. Opin. Microbiol., 5, 520524.[CrossRef][Web of Science][Medline]
- Schloss, P. D. and Handelsman, J. 2003, Biotechnological prospects from metagenomics, Curr. Opin. Biotechnol., 14, 303310.[CrossRef][Web of Science][Medline]
- Kohonen, T. 1990, The self-organizing map, Proc. IEEE, 78, 14641480.[CrossRef]
- Kohonen, T., Oja, E., Simula, O., Visa, A., Kangas, J. 1996, Engineering applications of the self-organizing map, Proc. IEEE, 84, 13581384.[CrossRef]
- Abe, T., Kanaya, S., Kinouchi, M., Ichiba, Y., Kozuki, T., Ikemura, T. 2003, Informatics for unveiling hidden genome signatures, Genome Res., 13, 693702.
[Abstract/Free Full Text] - Kanaya, S., Kinouchi, M., Abe, T., et al. 2001, Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome, Gene, 276, 8999.[CrossRef][Web of Science][Medline]
- Venter, J. C., Remington, K., Heidelberg, J. F., et al. 2004, Environmental genome shotgun sequencing of the Sargasso Sea, Science, 304, 6674.
[Abstract/Free Full Text] - Karlin, S., Campbell, A. M., Mrazek, J. 1998, Comparative DNA analysis across diverse genomes, Annu. Rev. Genet., 32, 185225.[CrossRef][Web of Science][Medline]
- Nussinov, R. 1984, Doublet frequencies in evolutionary distinct groups, Nucleic Acids Res., 12, 17491763.
[Abstract/Free Full Text] - Gentles, A. J. and Karlin, S. 2001, Genome-scale compositional comparisons in eukaryotes, Genome Res., 11, 540546.
[Abstract/Free Full Text] - Teeling, H., Meyerdierks, A., Bauer, M., Amann, R., Glockner, F. O. 2004, Application of tetranucleotide frequencies for the assignment of genomic fragments, Environ. Microbiol., 6, 938947.[CrossRef][Medline]
- Tyson, G. W., Chapman, J., Hugenholtz, P., et al. 2004, Community structure and metabolism through reconstruction of microbial genomes from the environment, Nature, 428, 3743.[CrossRef][Medline]
- Uchiyama, T., Abe, T., Ikemura, T., Watanabe, K. 2005, Substrate-induced gene-expression screening of environmental metagenome libraries for isolation of catabolic genes, Nat. Biotechnol., 23, 8893.[CrossRef][Web of Science][Medline]
- Hayashi, H., Abe, T., Sakamoto, M., et al. 2005, Direct cloning of genes encoding novel xylanases from human gut, Can. J. Microbiol., 51, 251259.[CrossRef][Web of Science][Medline]
- Kunst, F., Ogasawara, N., Moszer, I., et al. 1997, The complete genome sequence of the gram-positive bacterium Bacillus subtilis, Nature, 390, 249256.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
T. Abe, S. Kanaya, H. Uehara, and T. Ikemura A Novel Bioinformatics Strategy for Function Prediction of Poorly-Characterized Protein Genes Obtained from Metagenome Analyses DNA Res, October 3, 2009; (2009) dsp018v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Abe, T. Ikemura, Y. Ohara, H. Uehara, M. Kinouchi, S. Kanaya, Y. Yamada, A. Muto, and H. Inokuchi tRNADB-CE: tRNA gene database curated manually by experts Nucleic Acids Res., January 1, 2009; 37(suppl_1): D163 - D168. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Kunin, A. Copeland, A. Lapidus, K. Mavromatis, and P. Hugenholtz A Bioinformatician's Guide to Metagenomics Microbiol. Mol. Biol. Rev., December 1, 2008; 72(4): 557 - 578. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Manichanh, C. E. Chapple, L. Frangeul, K. Gloux, R. Guigo, and J. Dore A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library Nucleic Acids Res., September 1, 2008; 36(16): 5180 - 5188. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Martin, N. N. Diaz, J. Ontrup, and T. W. Nattkemper Hyperbolic SOM-based clustering of DNA fragment features for taxonomic visualization and classification Bioinformatics, July 15, 2008; 24(14): 1568 - 1574. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. E. Dutilh, Y. He, M. L. Hekkelman, and M. A. Huynen Signature, a web server for taxonomic characterization of sequence samples using signature genes Nucleic Acids Res., July 1, 2008; 36(suppl_2): W470 - W474. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Kosaka, S. Kato, T. Shimoyama, S. Ishii, T. Abe, and K. Watanabe The genome of Pelotomaculum thermopropionicum reveals niche-associated evolution in anaerobic microbiota Genome Res., March 1, 2008; 18(3): 442 - 448. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

-proteobacteria (
-proteobacteria (

-proteobacteria (




