Skip Navigation


DNA Research Advance Access originally published online on January 10, 2006
DNA Research 2005 12(5):281-290; doi:10.1093/dnares/dsi015
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary data
Right arrowOA All Versions of this Article:
12/5/281    most recent
dsi015v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrow Request Permissions
Google Scholar
Right arrow Articles by Abe, T.
Right arrow Articles by Ikemura, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Abe, T.
Right arrow Articles by Ikemura, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Kazusa DNA Research Institute
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Novel Phylogenetic Studies of Genomic Sequence Fragments Derived from Uncultured Microbe Mixtures in Environmental and Clinical Samples

Takashi Abe1,*, Hideaki Sugawara1, Makoto Kinouchi2, Shigehiko Kanaya3 and Toshimichi Ikemura4

1Center for Information Biology, National Institute of Genetics, The Graduate University for Advanced Studies (Sokendai) Mishima, Shizuoka 411-8540, Japan
2Department of Bio-System Engineering, Faculty of Engineering, Yamagata University Yonezawa, Yamagata 992-8510, Japan
3Department of Bioinformatics and Genomes, Graduate School of Information Science, Nara Institute of Science and Technology Takayama, Ikoma, Nara 630-0101, Japan
4The Graduate University for Advanced Studies (Sokendai), Hayama Center for Advanced Research Hayama-cho, Kanagawa 240-0193, Japan

Received 11 July 2005; revised 2 September 2005


    Abstract
 Top
 Abstract
 1. Introduction
 2. Materials and Methods
 3. Results
 4. Discussion
 Supplementary material
 Acknowledgements
 References
 
A self-organizing map (SOM) was developed as a novel bioinformatics strategy for phylogenetic classification of sequence fragments obtained from pooled genome samples of uncultured microbes in environmental and clinical samples. This phylogenetic classification was possible without either orthologous sequence sets or sequence alignments. We first constructed SOMs for tetranucleotide frequencies in 210 000 5 kb sequence fragments obtained from 1502 prokaryotes for which at least 10 kb of genomic sequence has been deposited in public DNA databases. The sequences could be classified primarily according to phylogenetic groups without information regarding the species. We used the SOM method to classify sequence fragments derived from environmental samples of the Sargasso Sea and of an acidophilic biofilm growing in acid mine drainage. Phylogenetic diversity of the environmental sequences was effectively visualized on a single map. Sequences that were derived from a single genome but cloned independently could be reassociated in silico. G + C% has been used for a long period as a fundamental parameter for phylogenetic classification of microbes, but the G + C% is apparently too simple a parameter to differentiate a wide variety of known species. Oligonucleotide frequency can be used to distinguish the species because oligonucleotide frequencies vary significantly among their genomes.

Key words: self-organizing map; environmental samples; metagenome; phylogenetic classification; SOM


    1. Introduction
 Top
 Abstract
 1. Introduction
 2. Materials and Methods
 3. Results
 4. Discussion
 Supplementary material
 Acknowledgements
 References
 
Most environmental microorganisms cannot be cultured easily under laboratory conditions. Genomes of uncultured organisms have remained mostly uncharacterized and are thought to contain a wide range of novel genes of scientific and industrial interest.1Go–6Go Metagenomic approaches, which are analyses of mixed populations of uncultured microbes, have been developed to identify novel and industrially useful genes and to study microbial diversity in a wide variety of environments. With the metagenomic approach, genomic DNAs are extracted directly from an environmental sample containing multiple organisms, and the DNA fragments are cloned and sequenced. This is a powerful strategy for comprehensive analysis of biodiversity in an ecosystem.2Go,4Go–6Go However, with a simple collection of many sequence fragments, it is difficult to predict from what phylotypes individual sequences are derived or the phylogenetic novelty of the individual sequences. This is because the conventional phylogenetic classification of sequences is based on sequence homology searches, which require orthologous sequence sets; therefore, this strategy cannot be applied to poorly characterized or novel sequences. In the present study, a new phylogenetic classification method that does not require the orthologs was developed on the basis of an unsupervised neural network algorithm, Kohonen's self-organizing map (SOM).7Go,8Go The conventional SOM was previously modified for genome informatics to make the learning process and resulting map independent of the order of data input,9Go,10Go and the modified SOM method was developed for phylogenetic classification of genomic sequences from environmental and clinical samples. Sequence fragments derived from environmental samples of the Sargasso Sea near Bermuda11Go could be classified into phylotypes.


    2. Materials and Methods
 Top
 Abstract
 1. Introduction
 2. Materials and Methods
 3. Results
 4. Discussion
 Supplementary material
 Acknowledgements
 References
 
2.1. SOM methods
We previously modified the conventional Kohonen's SOM7Go,8Go for genome informatics to make the learning process and resulting map independent of the order of data input.9Go,10Go When we constructed the SOMs for di-, tri- and tetranucleotide frequencies in 10 kb genomic sequences from a wide range of prokaryotes and eukaryotes, the sequences were clustered primarily according to species without information regarding the species, and increasing the lengths of the oligonucleotides from di- to tetranucleotides increased the clustering power.9Go In the present study, therefore, tri- and tetranucleotide SOMs were used according to the procedures as described previously.9Go The initial weight vectors were defined by principal component analysis (PCA) instead of random values. Weight vectors (wij) were arranged in the 2D lattice denoted by i (= 0, 1, ..., I – 1) and j (= 0, 1, ..., J 1). I was set as 350 and 550 in Figs 1 and 3, respectively, and J was defined by the nearest integer greater than ({sigma}2/{sigma}1) x I. {sigma}1 and {sigma}2 were the standard deviations of the first and second principal components, respectively. Weight vectors (wij) were set and updated as described previously.9Go The batch-learning SOM program could be obtained from Xanagen, Inc. (http://www.xanagen.com; Email: info{at}xanagen.com) and G-inforBIO (http://wdcm.nig.ac.jp/inforbio/).


Figure 1
View larger version (147K):
[in this window]
[in a new window]
 
SOMs for non-overlapping 1 kb and overlapping 5 kb sequences of 81 prokaryotic genomes. (A) Tri- and tetranucleotide SOMs. Lattice points that include sequences from more than one species are indicated in black, and those containing sequences from a single species are indicated in color as follows: Aquifex aeolicus (Figure 1), Archaeoglobus fulgidus (Figure 1), Aeropyrum pernix (Figure 1), Agrobacterium tumefaciens (Figure 1), Borrelia burgdorferi (Figure 1), Bacillus halodurans (Figure 1), Brucella melitensis (Figure 1), Bacillus subtilis (Figure 1), Buchnera sp. (Figure 1), Clostridium acetobutylicum (Figure 1), Caulobacter crescentus (Figure 1), Campylobacter jejuni (Figure 1), Chlamydia muridarum (Figure 1), Clostridium perfringens (Figure 1), Chlamydophila pneumoniae (Figure 1), Chlamydia trachomatis (Figure 1), Deinococcus radiodurans (Figure 1), Escherichia coli (Figure 1), Fusobacterium nucleatum (Figure 1), Halobacterium sp. (Figure 1), Haemophilus influenzae (Figure 1), Helicobacter pylori (Figure 1), Lactococcus lactis (Figure 1), Listeria monocytogenes and innocua (Figure 1), Methanosarcina acetivorans (Figure 1), Mycoplasma genitalium (Figure 1), Methanococcus jannaschii (Figure 1), Methanopyrus kandleri (Figure 1), Mycobacterium leprae (Figure 1), Mesorhizobium loti (Figure 1), Mycoplasma pneumoniae (Figure 1), Mycoplasma pulmonis (Figure 1), Methanobacterium thermoautotrophicum (Figure 1), Mycobacterium tuberculosis (Figure 1), Neisseria meningitidis (Figure 1), Pyrococcus abyssi (Figure 1), Pseudomonas aeruginosa (Figure 1), Pyrobaculum aerophilum (Figure 1), Pyrococcus furiosus (Figure 1), Pyrococcus horikoshii (Figure 1), Pasteurella multocida (Figure 1), Rickettsia conorii (Figure 1), Rickettsia prowazekii (Figure 1), Ralstonia solanacearum (Figure 1), Streptococcus agalactiae (Figure 1), Staphylococcus aureus (Figure 1), Streptomyces coelicolor (Figure 1), Sinorhizobium meliloti (Figure 1), Streptococcus pneumoniae (Figure 1), Sulfolobus solfataricus (Figure 1), Sulfolobus tokodaii (Figure 1), Salmonella typhimurium (Figure 1), Synechocystis sp. (Figure 1), Thermoplasma acidophilum (Figure 1), Thermotoga maritima (Figure 1), Teponema pallidum (Figure 1), Thermoanaerobacter tengcongensis (Figure 1), Thermoplasma volcanium (Figure 1), Ureaplasma urealyticum (Figure 1), Vibrio cholerae (Figure 1), Xanthomonas campestris and axonopodis (Figure 1), Xylella fastidiosa (Figure 1), and Yersinia pestis (Figure 1). For details of species-specific territories, see Supplementary Figure, on which numbers indicating species names are added. (B) Phylogenetic classification into 13 prokaryotic groups on DegeTetra-SOMs. Lattice points that include sequences from more than one phylotype are indicated in black, and those containing sequences from one phylotype are indicated in color as follows: {alpha}-proteobacteria (Figure 1), ß-proteobacteria (Figure 1), {gamma}-proteobacteria (Figure 1), {delta}-proteobacteria (Figure 1), Actinobacteria (Figure 1), Archaea (Figure 1), Chlamydia (Figure 1), Cyanobacteria (Figure 1), Deinococcus-Thermus (Figure 1), Firmicutes (Figure 1), Fusobacteria (Figure 1), Spirochaetales (Figure 1) and Thermotogales (Figure 1).

 
2.2. Nucleotide sequences
Nucleotide sequences were obtained from http://www.ddbj.nig.ac.jp/anoftp-e.html. When the number of undetermined nucleotides (Ns) in a sequence exceeded 10% of the window size, the sequence was omitted from the analysis. When the number of Ns was <10%, the oligonucleotide frequencies were normalized to the length without Ns and included in the analysis.


    3. Results
 Top
 Abstract
 1. Introduction
 2. Materials and Methods
 3. Results
 4. Discussion
 Supplementary material
 Acknowledgements
 References
 
3.1. SOMs for species-known prokaryotes
To investigate the clustering power of SOM for a wide range of prokaryotic sequences, we first analyzed 81 prokaryotic genomes for which complete sequence is available (a total of 226 Mb). To avoid overrepresentation of specific genomes, one genome among different strains of one species or of closely related species was used. SOMs were constructed with tri- and tetranucleotide frequencies for 226 000 non-overlapping 1 kb and overlapping 5 kb sequences with a 1 kb sliding step. These short sequence fragments were tested because such short fragments can be cloned even from a small quantity of DNA samples difficult to be obtained, e.g. those from extreme environments. To define the initial weight vectors, tri- and tetranucleotide frequencies in the 226 000 sequences were analyzed by PCA as described previously.9Go,10Go After 100 learning cycles, the tri- and tetranucleotide frequencies in the sequences were represented by the weight vectors on the tri- and tetranucleotide SOMs (Tri- and Tetra-SOMs in Fig. 1, see also Supplementary Figure). Then, lattice points that contained sequences from a single species are indicated in color, and those that included sequences from more than one species are indicated in black. The clustering power of the 5 kb SOM was much higher than that of the 1 kb SOM for both the Tri- and Tetra-SOMs, and the clustering power of the Tetra-SOM was higher than that of the Tri-SOM for the same sequence length (Table 1).


View this table:
[in this window]
[in a new window]
 
The proportion (%) of colored lattice points and sequences belonging to the colored points.

 
SOM recognized in sequence fragments the key combinations of short oligonucleotide frequencies that are the signature feature of each genome12Go–15Go and separated the sequences into species-specific territories without information regarding the species. In DNA databases, only one strand of a pair of complementary sequences is registered. Our previous analyses revealed that sequence fragments from a single prokaryotic genome are often split into two territories that reflect the transcriptional polarities of the genes present in the fragment.9Go For phylogenetic classification of sequences from uncultured microbes, it is not necessary to know the transcriptional polarity of the sequence, and the split into two territories complicates assignment to species. Therefore, we tested a new type of SOM in which frequencies of a pair of complementary oligonucleotides (e.g. AACC and GGTT) were summed, and the SOMs for the degenerate sets of tri- and tetranucleotides were designated DegeTri- and DegeTetra-SOM, respectively. This nearly halved the computation time without loss of clustering power. The highest clustering power was observed for the 5 kb DegeTetra-SOM (Table 1), and therefore, this SOM is used for the later analysis of phylogenetic classification of novel sequences from environmental samples.

3.2. Diagnostic oligonucleotides for species separations
Analysis of the weight vectors of individual lattice points in Fig. 1 revealed that vectors with strongly biased frequencies were located on the edge of the map. The G + C% calculated from each lattice vector was reflected in the horizontal axis and increased from left to right in the 5 kb DegeTetra-SOM (G + C% in Fig. 2). In other words, sequences with high G + C% (red), which were derived primarily from GC-rich genomes, were clustered on the right side of the map. To visualize the species-specific patterns of oligonucleotide frequencies recognized by the SOM, the frequency of each pair of complementary tetranucleotides at each lattice point was calculated, divided into five level categories with an equal number of lattice points, and visualized with the level of red or blue; seven examples of the diagnostic patterns of the species separations are presented in Fig. 2. Transitions between red (overrepresentation) and blue (underrepresentation) for various tetranucleotides often coincided with borders for species separations. To clarify the biological implications of diagnostic tetranucleotides for the species separations, we examined association between the levels of palindromic tetranucleotides and restriction enzyme systems. We could identify 24 four-base cutter enzymes for the 81 prokaryotes analyzed, as referred to the REBASE restriction enzyme database (http://rebase.neb.com/rebase/rebase.html). The restriction site tetranucleotides were underrepresented for 23 of 24 enzymes (refer to blue zones noted with light blue arrows in palindromic tetranucleotide panels in Fig. 2). Underrepresentation of the restriction site tetranucleotides in the genome with the enzyme gene may be related to the potential danger for self-damage even in the presence of methylation systems. This result showed that the SOM effectively classified sequences according to biological categories, and a complex combination of many oligonucleotides should contribute to species separation. For example, wide varieties of oligonucleotide sequences are known to function as genetic signals (e.g. regulatory signals for gene expression), and some of the important signal sequences may be biased significantly toward non-random occurrence and presumably be diagnostic for species separation.


Figure 2
View larger version (81K):
[in this window]
[in a new window]
 
Oligonucleotides related with species separation. The 5 kb DegeTetra-SOM presented in Fig. 1B. G + C%: G + C% calculated from the weight vector of individual lattice points in the 5 kb DegeTetra-SOM were divided into five level categories with an equal number of lattice points; and lattice points belonging to the categories of the highest, second-highest, middle, second-lowest and lowest G + C% are shown in dark red, light red, white, light blue and dark blue, respectively. Seven examples of diagnostic tetranucleotides for species separation are presented; five represent palindromic cases. Levels of each pair of complementary tetranucleotides calculated from individual lattice vectors were divided into five categories containing an equal number of lattice points, and the categories are shown as described above. Zones for species that have genes encoding a restriction enzyme that recognizes the respective tetranucleotide are noted by light blue arrows with the following numbers to show the species name: 1, H. pylori; 2, M. jannaschii; 3, F. nucleatum; 4, S. aureus; 5, L. lactis; 6, S. pneumoniae; 7, N. meningitidis.

 
3.3. Phylogenetic classification of species-known sequences
In the phylogenetic classification of unculturable and poorly characterized species, classification into phylotypes rather than into individual species is important. Therefore, focusing on the DegeTetra-SOMs in Fig. 1A, we examined the classification of sequences from the 81 prokaryotes into 13 major phylotypes by referring to the DNA Data Bank of Japan (DDBJ) Taxonomy Database (http://sakura.ddbj.nig.ac.jp/uniTax.html). Lattice points that contained sequences from one phylotype are indicated in color, and those that included sequences from more than one phylotype are shown in black (Fig. 1B). The number of black points was lower than that of the species classification. On the 5 kb DegeTetra-SOM, 88% of the sequences were classified into the correct phylotype territory (Table 1), which was defined as a continuous territory represented by a single color for the phylotype. Species that showed close phylogenetic relations in one phylotype tended to have neighboring zones and produced a continuous territory in the phylotype classification. It should also be noted that different species belonging to one phylotype were often separated from each other. One cause of this separation was related to differences in genome G + C% between species of a single phylotype. For example, we observed a separation between GC- and AT-rich {gamma}-proteobacteria (Formula ).

It is worth noting that paired-end reads in DNA sequencing, such as two 500 nt sequences from one cloned fragment, can be used as a single 1 kb sequence in the calculation of oligonucleotide frequencies. When oligonucleotide frequencies were calculated and normalized for sequence length, 1 kb sequences could be mapped on the 5 kb SOM (1 kb on 5 kb SOM in Table 1). The proportions of 1 kb sequences classified into the correct phylotype were 69 and 85% on the 1 and 5 kb DegeTetra-SOMs, respectively, indicating that for phylogenetic classification of species-unknown sequences, even short sequences (e.g. 1 kb) must be mapped on an SOM constructed with longer species-known sequences (e.g. 5 kb). The increased hit level in mapping on the 5 kb SOM is presumably because the SOM can extract species-specific characteristics of oligonucleotide frequencies more accurately and because the statistical fluctuation of oligonucleotide frequencies decreases as the analyzed sequences become longer.

For phylogenetic classification of novel sequences from environmental samples, SOMs should be constructed in advance with all available sequences of known species rather than only those from completely sequenced genomes. We thus constructed a DegeTetra-SOM with 210 000 non-overlapping 5 kb sequences (a total of 1.05 Gb) from 1502 species-known prokaryotes for which at least 10 kb of sequence has been deposited in DDBJ/EMBL/GenBank. These 1502 prokaryotes were classified into 25 phylotypes with reference to the NCBI Taxonomy Database. The classification according to phylotype was apparent (Species-known Seq. in Fig. 3A).


Figure 3
View larger version (133K):
[in this window]
[in a new window]
 
Phylogenetic classification of environmental sequences. (A) DegeTetra-SOM of non-overlapping 5 kb sequences of species-known prokaryotes. The genomic sequences from 1502 prokaryotes were used. Species-known Seq.: prokaryotic sequences were classified into 25 phylotypes. Lattice points that include sequences from more than one phylotype are indicated in black, and those that contain sequences from a single phylotype are indicated in color as follows: {alpha}-proteobacteria (Figure 3), ß-proteobacteria (Figure 3), {gamma}-proteobacteria (Figure 3), {delta}-proteobacteria (Figure 3), {varepsilon}-proteobacteria (Figure 3), Actinobacteria (Figure 3), Aquificae (Figure 3), Bacteroidetes (Figure 3), Chlamydiae (Figure 3), Chlorobi (Figure 3), Chloroflexi (Figure 3), Crenarchaeota (Figure 3), Cyanobacteria (Figure 3), Deinococcus-Thermus (Figure 3), Dictyoglomi (Figure 3), Euryarchaeota (Figure 3), Fibrobacteres (Figure 3), Firmicutes (Figure 3), Fusobacteria (Figure 3), Nitrospirae (Figure 3), Planctomycetes (Figure 3), Spirochaetales (Figure 3), Thermodesulfobacteriales (Figure 3), Thermotogales (Figure 3) and Verrucomicrobiae (Figure 3). Sargasso Seq. >5 kb and 1 kb: 1 kb sequence fragments derived from contigs longer than 5 and 1 kb were mapped on the 5 kb DegeTetra-SOM. The residual sequence segment <1 kb derived from each contig was omitted from the analysis. Lines were drawn to show phylotype borders. The dominant genomes noted by Venter et al.11Go are indicated by arrows with the following numbers to show the genus name: 1, Synechococcus; 2, Prochlorococcus; 3, 4, Shewanella; 5, 6, Burkholderia. All Sargasso Seq. 3D: all 811 000 Sargasso sequences were mapped after normalization of the sequence length, and square root of the number of Sargasso sequences mapped on each lattice point is indicated by the height of the bar. (B) DegeTetra-SOM of environmental and species-known sequences. SOM was constructed with 211 000 5 kb prokaryote sequences plus 218 000 Sargasso and 5000 biofilm 1 kb sequences after normalization of the sequence length. Sargasso and biofilm sequence entries longer than 1 kb were selected and divided into 1 kb fragments, and the residual segment <1 kb was omitted from the analysis. Species-known plus environmental Seq.: lattice points that contain sequences from a single phylotype are indicated in color as described in Fig. 3A, those that contain only Sargasso or biofilm sequences are indicated in color (Figure 3) or (Figure 3), and those that include environmental and species-known sequences or those from more than one known phylotype are indicated in black. Biofilm Seq.: square root of the number of biofilm sequences classified into each lattice point is indicated by the height of the bar distinctively colored to show the dominant species reported by Tyson et al.16Go: Ferroplasma (Figure 3), Leptospirillum (Figure 3) and Thermoplasmatales (Figure 3). Sargasso Seq. unclassified: square root of the number of Sargasso sequences classified into each lattice point containing no species-known sequences is indicated by the height of the bar indicated in color (Figure 3). Sargasso Seq. classified: square root of the number of Sargasso sequences that were classified into lattice points containing species-known sequences from a single phylotype is indicated by the height of the bar distinctively colored to show the phylotype.

 
3.4. Phylogenetic classification of environmental sequences
One goal of metagenomic studies is to characterize and hopefully reconstruct multiple genomes, at least for dominant species in an environment, by sequencing a large number of DNA fragments obtained from an environmental sample. Such an approach should allow extensive surveys of sequences useful in scientific and industrial applications and assist in developing accurate views of the ecology of uncultured microbes. The traditional homology-based phylogenetic classification methods, however, have inevitably focused on well-characterized genes such as rDNA, for which orthologs from a wide range of phylotypes are available, but most well-characterized genes, including rDNA, are not industrially attractive. It would be best if microbial diversity could be assessed during the process of screening for novel genes with industrial and scientific significance. We developed such a method.

Venter et al.11Go applied shotgun sequencing to mixed genomes collected from the Sargasso Sea near Bermuda and deposited ~811 000 sequence fragments (a total of 1 Gb) in DDBJ/EMBL/GenBank. These environmental sequences were analyzed with the newly developed SOM method. A majority of Sargasso sequence entries in the database corresponded to one-path read sequences; however, a limited number represented contigs assembled from multiple sequences derived from dominant genomes in this environment. We first selected 4258 entries of the contigs longer than 5 kb and mapped 34 000 fragments of 1 kb derived from these contigs on the aforementioned SOM constructed for the 1502 known prokaryotes (Sargasso Seq. >5 kb in Fig. 3A). Distinct clusters of Sargasso sequences were observed, and all dominant species reported previously11Go were included in these clusters. We then mapped 218 400 fragments of 1 kb derived from 134 600 entries longer than 1 kb and all 811 000 sequences (Sargasso Seq. >1 kb and All Sargasso Seq. 3D, respectively). Although the sequences were not clustered tightly, skewed distributions were observed. This is more easily visualized with a 3D representation in which the number of Sargasso sequences classified into each lattice point is indicated by the height of a bar. Zones with abundant Sargasso sequences were located at the bottom left of the SOM, which is an area that contains AT-rich sequences. In the respective zone, the sizes of individual territories of known phylotypes were relatively small, and the territories were complex (Species-known Seq. in Fig. 3A), indicating that sequences from various poorly characterized species were crowded there. The A + T-rich Sargasso sequences derived from novel species might be mapped there because of the high heterogeneity for the species-known sequences in the respective zones.

To investigate this possibility, we next constructed a DegeTetra-SOM with the species-known sequences plus Sargasso sequences longer than 1 kb. In addition, to confirm in silico reassociation of sequences derived from a single genome but cloned independently in a metagenomic library, we included sequence fragments derived from microbe mixtures in an acidophilic biofilm growing in acid mine drainage16Go which is a worldwide environmental problem. Because of low-complexity of the mixed genomes, Tyson et al.16Go focused on this biofilm in order to reconstruct dominant genomes with shotgun sequencing of the metagenomic library. In the ‘Species-known and environmental Seq.’ panel in Fig. 3B, lattice points that contained only Sargasso or biofilm sequences are indicated by (Formula ) or (Formula ), respectively; those that contained sequences from a single known phylotype are colored as described in Fig. 3A; and those that included sequences from more than one phylotype or both environmental and species-known sequences are shown in black. Most of the biofilm sequences derived from the low-complexity metagenome library were located in a few distinct territories (Biofilm Seq. in Fig. 3B), confirming that most sequence fragments derived from a single genome but cloned independently can be reassociated in silico. This can provide the rationale for phylogenetic classification of sequences derived from a high-complexity library such as the Sargasso sequences. The territories of species-known sequences in Fig. 3B shrank appreciably when compared with those in Fig. 3A, indicating that a large portion of Sargasso sequences had oligonucleotide frequencies distinct from those of species-known prokaryotes. The number of Sargasso sequences classified into lattice points containing no species-known sequences is indicated by the height of the (Formula ) bar (Sargasso Seq. unclassified in Fig. 3B); 79% of the Sargasso sequences belonged to this phylotype-unclassified category. These Sargasso sequences should correspond to sequences derived from genomes that are poorly characterized at the present moment. The remaining 21% of Sargasso sequences were associated with species-known sequences (black lattices in Species-known and environmental Seq. in Fig. 3B). The number of Sargasso sequences classified into lattice points containing sequences only from a single phylotype regarding species-known sequences is indicated by the height of the bar colored to indicate the phylotype (Sargasso Seq. classified). On our Web page (http://lavender.genes.nig.ac.jp/takaabe/DNAres/SPT1.htm), 92 genera, whose sequences were associated with the Sargasso sequences, are listed together with numbers of the associated Sargasso sequences. It should be noted that most of the Sargasso sequences have not been characterized phylogenetically because there were no orthologs for these novel sequences. Detailed phylotype predictions for individual Sargasso sequences with the SOM method are available on our Web page (http://lavender.genes.nig.ac.jp/takaabe/DNAres/SPT2.htm).


    4. Discussion
 Top
 Abstract
 1. Introduction
 2. Materials and Methods
 3. Results
 4. Discussion
 Supplementary material
 Acknowledgements
 References
 
G + C% has been used for a long period as a fundamental parameter for phylogenetic classification of microbes, but the G + C% is apparently too simple a parameter to differentiate a wide variety of known species. Oligonucleotide frequency can be used to distinguish genomes because oligonucleotide frequencies vary significantly among genomes.12Go–15Go The present phylogenetic classification is designed as an extension of one parameter ‘G + C%’ to the multiple parameters ‘oligonucleotide frequencies’ utilizing SOM. In the case of the homology-based phylogenetic classification, a set of orthologous sequences from a wide range of species is an absolute requirement, and therefore, the conventional method is difficult to apply to the classification of novel sequences from poorly characterized species. Because the SOM does not require the orthologs, the present strategy can enhance novel metagenomic studies of uncultured microbes.17Go,18Go For example, the method can be applied to analyses of microbial and viral sequences derived from clinical specimens (feces and phlegm), including sterilized ones. If SOMs are constructed in advance with a large number of sequences from a wide variety of clinical samples systematically collected from various sources plus sequences from all known species, this technique may be used to screen unidentified pathogenic microorganisms responsible for poorly characterized infectious diseases.

When we consider phylogenetic classification of sequences from uncultured microbes in a very complex ecosystem, it will become important to construct SOMs in advance with all prokaryotic and eukaryotic sequences available. For example, a certain portion of Sargasso sequences may be derived from eukaryotic genomes such as those of fungi, protozoa and fishes. Furthermore, when microorganisms that have symbiotic/parasitic relation with a macroorganism are studied, sequences from the macroorganism may be included in the sequence collection. Specimens in medical studies may also contain various eukaryotic DNAs. To test the SOM separation of prokaryotic and eukaryotic sequences with special reference to Sargasso sequences, we constructed a DegeTetra-SOM with 5 kb sequences from the 1502 prokaryotes plus those from 11 unicellular eukaryotes (6 fungi and 5 protozoa) and zebrafish (Fig. 4A). The power of SOM to separate prokaryotic from eukaryotic sequences was very high; only 0.1% prokaryotic sequences were classified into eukaryotic territories. Furthermore, separation among eukaryotic species was apparent (Eukaryote Seq. separately colored). Next we mapped Sargasso sequences longer than 1 kb on this SOM (Sargasso Seq. mapped). Although a major portion of the Sargasso sequences were classified into specific prokaryote territories, 9.9% were classified into eukaryotic territories that corresponded mainly to territories of unicellular eukaryotes; 13 times more densely in unicellular eukaryote territories than in the zebrafish territory.


Figure 4
View larger version (110K):
[in this window]
[in a new window]
 
SOMs for prokaryotic and eukaryotic sequences and for 16S rDNAs. (A) SOM was constructed with 211 000 and 210 600 5 kb sequences of 1502 prokaryotes and 12 eukaryotes, respectively. Prokaryote Seq. separately colored: lattice points that contain sequences from a single prokaryotic phylotype are indicated in colors as described in Fig. 3A; those that contain only eukaryotic sequences are indicated in color (Figure 4); those that include both prokaryotic and eukaryotic sequences or sequences from more than one prokaryotic phylotype are indicated in black. Eukaryote Seq. separately colored: lattice points that contain only prokaryotic sequences are indicated in color (Figure 4); those that include both prokaryotic and eukaryotic sequences or sequences from more than one eukaryotic species are indicated in black; those that contain sequences only from a single eukaryote are indicated in color as follows: Aspergillus nidulans (Figure 4), Encephalitozoon cuniculi (Figure 4), Eremothecium gossypii (Figure 4), Saccharomyces cerevisiae (Figure 4), Schizosaccharomyces pombe (Figure 4), Neurospora crassa (Figure 4), Cryptosporidium parvum (Figure 4), Dictyostelium discoideum (Figure 4), Plasmodium yoelii (Figure 4), Theileria annulata (Figure 4), Trypanosoma brucei (Figure 4) and zebrafish Danio rerio (Figure 4). Sargasso Seq. mapped: 1 kb sequence fragments derived from contigs longer than 1 kb were mapped on the 5 kb DegeTetra-SOM after normalization of the sequence length and are indicated in color (Figure 4). The residual segment <1 kb derived from each contig was omitted from the analysis. Lines were drawn to show borders of eukaryotic species. Sargasso Seq. mapped, 3D: square root of the number of Sargasso sequences mapped on each lattice point is indicated by the height of the bar in color (Figure 4). (B) Species-known 16S rDNA: Tetra-SOMs were constructed for 38 660 16S rDNA sequences longer than 1 kb from 19 196 known prokaryotes after normalization for the sequence length. Lattice points are indicated by black or in colors as described in Fig. 3A. Species-unknown 16S rDNA: we selected 6855 sequences for which classification into 25 phylotypes is annotated in GenBank and mapped these sequences on the species-known 16S rDNA Tetra-SOM. Lattice points that contained sequences of a single phylotype annotated are indicated in color representing the phylotype, and those that included sequences from more than one phylotype are indicated in black.

 
Genome segments introduced through horizontal gene transfer from a phylogenetically distant genome tend to retain the sequence characteristics of the donor genome and can be distinguished from those of the host genome. Even in the 5 kb DegeTetra-SOM in Fig. 1B, there are lattice points marked in black which should contain sequences with tetranucleotide frequencies distinct from the major portion of the host genome. Such sequences may correspond, at least in part, to segments transferred horizontally from a distant phylotype. When we investigated Bacillus subtilis sequences located outside the Firmicutes territory, we found that many were derived from pathogenic islands where prophage- and foreign-type sequences are clustered.19Go The SOM, as well as the sequence homology-based method, may classify such horizontally transferred sequences into the donor genome. Although the information regarding horizontal transfer is interesting, it creates problems in phylogenetic classification of environmental sequences. Therefore, it is desirable to develop a strategy to address this problem. When we constructed Tetra- and DegeTetra-SOMs with 38 660 16S ribosomal RNA gene (rDNA) sequences longer than 1 kb from 19 196 known prokaryotes available in GenBank, clear clustering according to phylotype was found with reference to the NCBI Taxonomy Database; 97 and 95% of sequences were classified into the correct phylotype territory on the Tetra- and DegeTetra-SOMs, respectively, and the result of the Tetra-SOM is presented (Species-known 16S rDNA in Fig. 4B). The reason why the hit level of Tetra-SOM is slightly higher than that of the DegeTetra-SOM may be attributable to the fact that rDNA sequences with the same polarity are registered in GenBank for all species. This is in contrast with the situation for usual genomic sequences. The SOM may detect details of sequence characteristics of 16S rDNAs that can be identified only in sequences with the same polarity.

When we focused on 16S rDNA sequences from species-unidentified prokaryotes in GenBank, we found 16 652 sequences longer than 1 kb. Because the main purpose of 16S rDNA sequencing of the unidentified prokaryotes in environments is phylogenetic assignment, phylotypes assigned with the homology-based method are often annotated in GenBank. We selected 6855 sequences for which classification into major phylotypes is annotated and mapped these sequences on the species-known 16S rDNA Tetra-SOM (Species-unknown 16S rDNA in Fig. 4B). Lattice points that contained sequences of a single phylotype annotated are indicated in color representing the phylotype, and those that included sequences from more than one phylotype are indicated in black. The major zones were marked in color, and the color pattern was almost identical to that of known species. Although the SOM does not require sequence alignments, assignments were practically identical to those obtained by the homology-based method for this case where a large number of orthologous sequences are available. When extensive shotgun sequencing of an environmental sample is done, substantial amounts of both non-rDNA and rDNA sequences become available. Combined SOM analyses of rDNA and non-rDNA sequences may provide detailed information regarding microbial diversity on the basis of detailed knowledge of rDNA phylogeny. This may also solve, at least in part, complications caused by horizontal transfer of sequences. For example, when no rDNA sequences are found in a certain phylotype territory in the rDNA SOM, but a statistically significant number of non-rDNA sequences is found in the respective phylotype territory in the SOM constructed with usual genomic sequences, these non-rDNA sequences may be candidates for horizontally transferred sequences. When a statistically significant number of rDNA sequences map to a certain phylotype territory in the rDNA SOM and is assigned to be derived from one species with the homology-based method on the basis of the detailed rDNA phylogeny, a major portion of the non-rDNA sequences mapped to this phylotype territory on the standard SOM can be predicted to be derived from the respective species. Collectively, the combination of SOMs for rDNA and non-rDNA sequences will reconstruct in silico the rDNA and non-rDNA sequences derived from a single genome but cloned independently.


    Supplementary material
 Top
 Abstract
 1. Introduction
 2. Materials and Methods
 3. Results
 4. Discussion
 Supplementary material
 Acknowledgements
 References
 
Supplementary material is available online at http://dnaresearch.oxfordjournals.org


    Acknowledgements
 Top
 Abstract
 1. Introduction
 2. Materials and Methods
 3. Results
 4. Discussion
 Supplementary material
 Acknowledgements
 References
 
This work was supported by grants from ACT-Japan Science and Technology Corporation and the Advanced and Innovational Research Program in Life Sciences and a Grant-in-Aid for Scientific Research on Priority Areas ‘Applied Genome’ from the Ministry of Education, Culture, Sports, Science and Technology of Japan. A part of the calculation was done with the Earth Simulator of Japan Agency of Marine-Earth Science and Technology.


    Footnotes
 
*To whom correspondence should be addressed. Tel. +81-55-981-6835, Fax. +81-55-981-6896, Email: takaabe{at}genes.nig.ac.jp

Communicated by Katsumi Isono


    References
 Top
 Abstract
 1. Introduction
 2. Materials and Methods
 3. Results
 4. Discussion
 Supplementary material
 Acknowledgements
 References
 

  1. Amann, R. I., Ludwig, W., Schleifer, K. H. 1995, Phylogenetic identification and in situ detection of individual microbial cells without cultivation, Microbiol. Rev., 59, 143–169.[Abstract/Free Full Text]
  2. Hugenholtz, P. and Pace, N. R. 1996, Identifying microbial diversity in the natural environment: a molecular phylogenetic approach, Trends Biotechnol., 14, 190–197.[CrossRef][ISI][Medline]
  3. Rondon, M. R., August, P. R., Bettermann, A. D., et al. 2000, Cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms, Appl. Environ. Microbiol., 66, 2541–2547.[Abstract/Free Full Text]
  4. Lorenz, P., Liebeton, K., Niehaus, F., Eck, J. 2002, Screening for novel enzymes for biocatalytic processes: accessing the metagenome as a resource of novel functional sequence space, Curr. Opin. Biotechnol., 13, 572–577.[CrossRef][ISI][Medline]
  5. DeLong, E. F. 2002, Microbial population genomics and ecology, Curr. Opin. Microbiol., 5, 520–524.[CrossRef][ISI][Medline]
  6. Schloss, P. D. and Handelsman, J. 2003, Biotechnological prospects from metagenomics, Curr. Opin. Biotechnol., 14, 303–310.[CrossRef][ISI][Medline]
  7. Kohonen, T. 1990, The self-organizing map, Proc. IEEE, 78, 1464–1480.[CrossRef]
  8. Kohonen, T., Oja, E., Simula, O., Visa, A., Kangas, J. 1996, Engineering applications of the self-organizing map, Proc. IEEE, 84, 1358–1384.[CrossRef]
  9. Abe, T., Kanaya, S., Kinouchi, M., Ichiba, Y., Kozuki, T., Ikemura, T. 2003, Informatics for unveiling hidden genome signatures, Genome Res., 13, 693–702.[Abstract/Free Full Text]
  10. Kanaya, S., Kinouchi, M., Abe, T., et al. 2001, Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome, Gene, 276, 89–99.[CrossRef][ISI][Medline]
  11. Venter, J. C., Remington, K., Heidelberg, J. F., et al. 2004, Environmental genome shotgun sequencing of the Sargasso Sea, Science, 304, 66–74.[Abstract/Free Full Text]
  12. Karlin, S., Campbell, A. M., Mrazek, J. 1998, Comparative DNA analysis across diverse genomes, Annu. Rev. Genet., 32, 185–225.[CrossRef][ISI][Medline]
  13. Nussinov, R. 1984, Doublet frequencies in evolutionary distinct groups, Nucleic Acids Res., 12, 1749–1763.[Abstract/Free Full Text]
  14. Gentles, A. J. and Karlin, S. 2001, Genome-scale compositional comparisons in eukaryotes, Genome Res., 11, 540–546.[Abstract/Free Full Text]
  15. Teeling, H., Meyerdierks, A., Bauer, M., Amann, R., Glockner, F. O. 2004, Application of tetranucleotide frequencies for the assignment of genomic fragments, Environ. Microbiol., 6, 938–947.[CrossRef][Medline]
  16. Tyson, G. W., Chapman, J., Hugenholtz, P., et al. 2004, Community structure and metabolism through reconstruction of microbial genomes from the environment, Nature, 428, 37–43.[CrossRef][Medline]
  17. Uchiyama, T., Abe, T., Ikemura, T., Watanabe, K. 2005, Substrate-induced gene-expression screening of environmental metagenome libraries for isolation of catabolic genes, Nat. Biotechnol., 23, 88–93.[CrossRef][ISI][Medline]
  18. Hayashi, H., Abe, T., Sakamoto, M., et al. 2005, Direct cloning of genes encoding novel xylanases from human gut, Can. J. Microbiol., 51, 251–259.[CrossRef][ISI][Medline]
  19. Kunst, F., Ogasawara, N., Moszer, I., et al. 1997, The complete genome sequence of the gram-positive bacterium Bacillus subtilis, Nature, 390, 249–256.[CrossRef][Medline]

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
C. Martin, N. N. Diaz, J. Ontrup, and T. W. Nattkemper
Hyperbolic SOM-based clustering of DNA fragment features for taxonomic visualization and classification
Bioinformatics, July 15, 2008; 24(14): 1568 - 1574.
[Abstract] [PDF]


Home page
Nucleic Acids ResHome page
B. E. Dutilh, Y. He, M. L. Hekkelman, and M. A. Huynen
Signature, a web server for taxonomic characterization of sequence samples using signature genes
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W470 - W474.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
T. Kosaka, S. Kato, T. Shimoyama, S. Ishii, T. Abe, and K. Watanabe
The genome of Pelotomaculum thermopropionicum reveals niche-associated evolution in anaerobic microbiota
Genome Res., March 1, 2008; 18(3): 442 - 448.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary data
Right arrowOA All Versions of this Article:
12/5/281    most recent
dsi015v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrow Request Permissions
Google Scholar
Right arrow Articles by Abe, T.
Right arrow Articles by Ikemura, T.
Right arrow Search for Related Content
PubMed
Right arrow