DNA Research Advance Access published online on June 15, 2007
DNA Research, doi:10.1093/dnares/dsm011
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Proteome-Wide Prediction of Novel DNA/RNA-Binding Proteins Using Amino Acid Composition and Periodicity in the Hyperthermophilic Archaeon Pyrococcus furiosus
1 Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0017, Japan
2 Systems Biology Program, Graduate School of Media and Governance, Keio University, Fujisawa 252-8520, Japan
3 Faculty of Environment and Information Studies, Keio University, Fujisawa 252-8520, Japan
Received 13 March 2007; revised 30 April 2007
| Abstract |
|---|
|
|
|---|
Proteins play a critical role in complex biological systems, yet about half of the proteins in publicly available databases are annotated as functionally unknown. Proteome-wide functional classification using bioinformatics approaches thus is becoming an important method for revealing unknown protein functions. Using the hyperthermophilic archaeon Pyrococcus furiosus as a model species, we used the support vector machine (SVM) method to discriminate DNA/RNA-binding proteins from proteins with other functions, using amino acid composition and periodicities as feature vectors. We defined this value as the composition score (CO) and periodicity score (PD). The P. furiosus proteins were classified into three classes (IIII) on the basis of the two-dimensional correlation analysis of CO score and PD score. As a result, approximately 87% of the functionally known proteins categorized as class I proteins (CO score + PD score > 0.6) were found to be DNA/RNA-binding proteins. Applying the two-dimensional correlation analysis to the 994 hypothetical proteins in P. furiosus, a total of 151 proteins were predicted to be novel DNA/RNA-binding protein candidates. DNA/RNA-binding activities of randomly chosen hypothetical proteins were experimentally verified. Six out of seven candidate proteins in class I possessed DNA/RNA-binding activities, supporting the efficacy of our method.
Key words: DNA/RNA-binding protein; amino acid periodicity; support vector machine; archaea
| 1. Introduction |
|---|
|
|
|---|
The last decade has been a remarkable time in the field of genome science. DNA sequences from over 2400 species have been determined,1
For the past few years, we have been working on RNA metabolism in the hyperthermophilic archaeon Pyrococcus furiosus8
10
and reported on our experimental system in which an expression cloning method is used for extracting DNA/RNA-binding proteins at the proteome level. During this work, we observed that charged amino acidssuch as aspartic acid, glutamic acid, arginine, and lysineappeared both in the sequence of the novel RNA-binding protein FAU-1 and Ribonuclease E in a periodic manner.8
It is possible that certain acidic and basic amino acid periodicities might affect the secondary or tertiary structure of a protein so that it gains DNA/RNA-binding activities. Amino acid periodicities are commonly observed features in the sequences of various proteins such as myosin and amyloids,11
serinethreonine, and tyrosine protein kinases12
and are known to be strongly correlated with their secondary structures.
The purpose of the current study was to demonstrate that a bioinformatics approach focusing on the periodicity in a protein's primary structure could be a suitable method for elucidating DNA/RNA-binding proteins. Previously, several support vector machine (SVM)-based methods were developed towards predicting DNA-binding and RNA-binding proteins on the basis of various amino acid profiles (i.e. overall composition, pseudo-amino acid composition, surface composition, electrostatic potential, and hydrophobicity).13
16
SVM is one of the most powerful supervised learning algorithm that has recently widely been used in the field of bioinformatics. We describe here an SVM-based method for classifying known DNA/RNA-binding proteins from P. furiosus using amino acid composition and periodicity as feature vectors. The discriminant values (SVM output) derived from these profiles were defined as two new indices: composition (CO) score and periodicity (PD) score. Amino acid composition are known to be strongly correlated with protein secondary structure class17
and subcellular localization18
,19
and are assumed to support the protein function classification. Therefore, on the basis of the two-dimensional correlation analysis, we combined amino acid composition (CO score) with the PD score to further improve the performance of DNA/RNA-binding protein prediction. The two-dimensional correlation analysis was then applied to hypothetical proteins of P. furiosus, and promising candidates for being novel DNA/RNA-binding proteins were selected. DNA/RNA-binding activities of these candidate proteins were examined experimentally and many of them were confirmed to possess DNA/RNA-binding activities.
| 2. Materials and methods |
|---|
|
|
|---|
2.1. Protein data set and functional annotations
Automated annotations and amino acid sequences of proteins from the two archaeal species, P. furiosus (2057 proteins) and Sulfolobus solfataricus (2934 proteins), were taken from the EMBL database (http://www.ebi.ac.uk/embl/; Release 83, June 2005). Each protein entry has a UniProt Knowledgebase (UniProtKB) accession code corresponding to its entry in either UniProtKB/Swiss-Prot (http://www.ebi.ac.uk/Swiss-Prot/; Release 47, May 2005) or UniProtKB/TrEMBL (http://www.ebi.ac.uk/trembl/; Release 31, September 2005). Both databases contain information on the gene ontology annotation (GOA: a combination of electronic assignment and manual annotation), and protein data are from the domain databases InterPro20
We defined functionally known proteins as functionally annotated proteins in the Swiss-Prot or TrEMBL databases with additional GOA. TrEMBL protein entries with no additional annotation were categorized as putative functional proteins. Proteins annotated as hypothetical in the database were defined as hypothetical proteins. DNA/RNA-binding proteins were defined as those proteins whose annotations included the following keywords in Swiss-Prot, TrEMBL, and GOA annotations: DNA, RNA, ribosome(al), RNP, ribonucleo-, helicase, nuclease, or nucleic acid binding. To reduce the bias of functional variety in the protein data set, the functionally known proteins of the six model species were filtered to remove homologous proteins at sequence identity level with E-value < 1 x 104 and short peptides <20 amino acids from future analyses. In total, we prepared 477 proteins of P. furiosus, 582 proteins of S. solfataricus, 914 of B. subtilis, 1436 of E. coli, 865 of A. thaliana, and 566 of C. elegans as a representative set for the analysis (Table 1).
|
2.2. Amino acid periodicity
To analyze amino acid periodicities, we used eight physico-chemical profiles (chemical, Sneath, Dayhoff, Stanfel, functional, charge, structural, and hydrophobicity)22
Amino acid periodicity was defined as the regular appearance of a certain amino acid group (X), Y (Y
3) times in a protein sequence with a period (the number of amino acids from one appearance to the next) of Z. Although a previous analysis in E. coli defined the range of periodicity as 2 to 50, to eliminate binal periodicities (ex: period 5 includes period 10), we used prime numbers and their multiples [2, 3, 5, 7, 8 (2 x 4), 9 (3 x 3), 11, 13, 15 (5 x 3), 17, 19]. To take into account the fluctuation of periodicities, we set the error range as ±1. For example, in seq1 (XXXXAXXAXXXX), A appears only twice, so no periodicity can be defined. Seq2 (XXBXXXXBXXXXBX) contains three Bs with a period of five (B-5 periodicity). Seq3 (XCXXXCXXCXXXCXXCX) contains five Cs with multiple periodicities (two of length 3, two of length 4, and two of length 7). On the basis of the error range ± 1, length 4 is included in length 3; therefore, Seq3 is defined to have C periods of only 3 and 7.
2.3. SVM classification of DNA/RNA-binding proteins based on amino acid periodicity and composition
SVM is a non-linear classifier creating a maximum-margin hyperplane by applying a kernel trick to the feature vectors. We performed two different SVM analysis on the basis of the individual data set of amino acid periodicity and amino acid composition. For amino acid periodicity, we calculated the relative coverage of the periodic region (R) of each training set (i) with 253 patterns of amino acid periodicities (j): 23 amino acid groups x 11 kinds (2, 3, 5, 7, 8, 9, 11, 13, 15, 17, 19 periods):
|
|
For amino acid composition, we calculated the relative composition of amino acid (C) of each training set (i) with 20 types of amino acids (k)
|
|
These factors were applied as feature vectors and classified into two distinct members: DNA/RNA-binding proteins and proteins with other functions. For SVM training, the data label for DNA/RNA-binding proteins was denoted as 1 and proteins with other functions was denoted as 1. SVM analysis in this study was performed using the default parameters in Gist package version 2.3, which contains software tools for SVM classification.23
We have tested two types of kernel function (linear and radial basis) and selected the kernel function with higher prediction performance. As a consequence, maximum-margin hyperplane was applied to the protein test set on the basis of a radial basis function kernel (r = 1), and the discriminant value of each protein were defined as the PD score. Likewise, linear kernel-based maximum-margin hyperplane was applied for the protein set on the basis of amino acid composition, and discriminant values were defined as the CO score.
2.4. Validation of PD score performance
The performance of the PD score at predicting novel proteins was validated on the basis of 10-fold cross-validation test. The 10-fold cross-validation test is one of the most reliable methods for estimating the performance of the predictor. For example, the 477 representative data set of P. furiosus was randomly split into 10 mutually exclusive subsets D1, D2, ... , D10 of approximately equal size. Each subset was tested on the basis of the training using the rest of the nine subsets. Estimated accuracies were derived as average values.
First, the classification accuracy of the PD score was compared with that of the randomly chosen single amino acid periodicities using receiver operating characteristic (ROC) studies. The ROC curve is a simple and effective method to compare the overall prediction performance of different methods including SVMs.24
26
The ROC curve is represented by two indices: sensitivity and specificity. The sensitivity and specificity of the PD score were calculated using a 10-fold cross-validation test with a PD score cut-off of 0. Equations are represented as follows:
|
|
Error bars were added for each data set representing the standard deviation values derived from the 10-fold cross-validation test. Secondly, PD score was compared against CO score (amino acid composition-based SVM) and other SVM-based protein function predictor, SVM-Prot.27
To assess the PD score performance, we calculated the overall accuracy (ACC) for PD score using 10-fold cross-validation test. Training data set of SVM-Prot is fixed as a combination of 54 functional protein families and predicts several functional classes owing to the probability of correct prediction. SVM-Prot uses 1943 positive set and 1353 negative set for training DNA-binding proteins and 871 positive set and 1120 negative set for training RNA-binding proteins. In total, 104 feature vectors [composition (C), transition (T), and distribution (D)] were calculated for each amino acid group classified by four physicochemical properties, hydrophobicity, Van der Waals volume, polarity, and polarizability, and were introduced to generate the hyperplane for each protein family.27
,28
To equally validate the prediction performance of SVM-Prot with our method, the proteins that were predicted as DNA/RNA-related with the highest probability were regarded as DNA/RNA-binding proteins. SVM-Prot was applied to the representative data set and ACC is calculated as follows:
|
|
2.5. Data fusion and threshold determination
Two heterogeneous data set (amino acid periodicity and amino acid composition) were integrated by three different approaches (early, intermediate, and late integration).29
Early integration performs single SVM with a feature vector of 253 (periodicity) and 20 (composition) dimensions. For intermediate integration, kernel values (kernel matrix) are separately computed on each data type and the summed kernel values are used for the training of SVM. Late integration performs SVM separately and later, the discriminant values were summed (i.e. CO + PD score). Three evaluation indices, SE, SP, and ACC, were calculated for the evaluation of three different integration methods.
Thresholds for extracting DNA/RNA-binding proteins were determined by considering several indices. An index, positive predictive value (PPV), was adapted to measure the percentage of DNA/RNA-binding proteins among proteins above threshold (blue line in Supplementary Fig. S1). PPV is calculated as follows:
|
|
The final prediction decision is given by the calculated value of the Matthews correlation coefficient (MCC)30
to determine the threshold value of the CO + PD score. MCC is a popular index for measuring the performance of prediction; maximum MCC provides efficient sensitivity and specificity.
|
|
2.6. Construction of expression vectors and purification of His-tagged recombinant proteins
Genomic DNA of P. furiosus DSM3638 was isolated using a GNOME kit (BIO101, La Jolla, CA, USA) and partially digested with the restriction enzyme Sau3AI. The resulting DNA fragments were fractionated by electrophoresis on a 0.7% agarose gel. Fragments of 15 kb were extracted from the gel and used as templates for PCR cloning. After PCR amplification using site-specific primers (Supplementary Table S4) with NdeI and XhoI sites at the 5' and 3' termini, respectively, each of the candidate genes was cloned into the pET-23b expression vector (Novagen, Madison, WI, USA). Insert DNA was sequenced and shown to be identical to database sequences.
Recombinant proteins were prepared as described previously.8
Briefly, E. coli strain BL21(DE3) was transformed with each expression plasmid; however, optimal protein production required E. coli strain BL21(DE3)pLysS for the expression of PF0565 and PF1473 proteins and strain HMS174(DE3)pLysS for the expression of PF1498. Transformants were grown at 37°C in LuriaBertani medium containing 50 µg/mL ampicillin and supplemented with 0.4 mM isopropylthio-b-galactoside. After 1416 h of further growth at 30°C, cells were harvested by centrifugation (5000g for 10 min at 4°C), and the recombinant proteins were released by sonication (2 min) in buffer A (20 mM TrisHCl, pH 8.0, 5 mM imidazole, 500 mM NaCl, 0.1% NP40). The extracts were heat-treated at 85°C for 15 min to destroy E. coli endogenous proteins and then centrifuged at 12 000g for 10 min at 4°C to remove cellular debris. The recombinant proteins were purified in an Ni2+Sepharose column according to the manufacturer's instructions (Amersham Pharmacia, Piscataway, NJ, USA). The peaks of the eluted proteins were pooled and dialyzed against buffer B (50 mM TrisHCl, pH 8.0, 1 mM EDTA, 0.02% Tween 20, 7 mM 2-mercaptoethanol, 10% glycerol).
2.7. Gel-shift assay
Hokkaido System Science Co. (Hokkaido, Japan) chemically synthesized 5' end FAM-labeled oligonucleotides. Binding reactions containing the oligonucleotide (125 or 500 nM) and 0.10.5 µg of purified recombinant protein were incubated for 15 min at either room temperature (24°C) or 75°C in 20 µL of DNA/RNA-binding buffer (10 mM TrisHCl, pH 7.5, 50 mM NaCl, 0.5 mM EDTA, 2.5 mM MgCl2, 5% glycerol, 1 mM dithiothreitol). The DNA/RNA-protein complexes were analyzed by 6% non-denaturing PAGE. The quantity of DNA/RNA-protein complexes was evaluated by scanning the fluorescent image with a computerized image analyzer, FX Pro (Bio-Rad Laboratories, Hercules, CA, USA). To sequence the oligonucleotides, we used the following two probes (Xiaojing et al., to be published separately):
- MPOR-27, 5'-r(GAAACAAGGAGAAAUGGUUCGUGUCCU)-3',
- MPOD-27, 5'-d(GAAACAAGGAGAAATGGTTCGTGTCCT)-3'.
| 3. Results and discussion |
|---|
|
|
|---|
3.1. Functional annotation of P. furiosus proteome and those of other model species
P. furiosus, S. solfataricus, B. subtilis, E. coli, C. elegans, and A. thaliana were used as model species. The hyperthermophilic archaeon P. furiosus was chosen for its topical importance in the evolution of the ancient architecture of DNA/RNA regulation31
From the EMBL database (Release 83, June 2005), we extracted reliable protein function data for P. furiosus (EMBL accession number AE009950) by unifying information from the three annotated databases Swiss-Prot, TrEMBL, and GOA.32
,33
We defined the three categories of proteins on the basis of the number and quality of annotations (see Methods and materials section). For example, 2057 P. furiosus proteins were categorized into 942 functionally known proteins, 121 proteins with putative function, and 994 hypothetical proteins. To eliminate proteins with similar amino acid sequences, we performed a homology search among the 942 functionally known proteins using BLASTP (E-value < 1 x 104) and reduced the protein data set to 477 non-redundant proteins for the periodicity analysis. To facilitate their use as a training data set for SVM learning, these functionally known proteins were further divided into 157 DNA/RNA-binding proteins and 320 proteins with other functions. The same procedure was applied to the EMBL data of the archaeon S. solfataricus (EMBL accession number AE006641) and the Swiss-Prot entries of the B. subtilis, E. coli, A. thaliana, and C. elegans proteomes (Table 1).
3.2. Amino acid periodicity score (PD score) and prediction of the DNA/RNA-binding proteins
To ascertain common features of amino acid periodicity throughout the DNA/RNA-binding protein sequences, we defined 23 amino acid groups using eight physico-chemical profiles (chemical, Sneath, Dayhoff, Stanfel, functional, charge, structural, and hydrophobicity). We prepared a total of 253 patterns of amino acid periodicities (23 groups x 11 non-redundant periodicities). For each training data set in the six model species, the relative coverage of periodic region R was calculated for 253 individual amino acid periodicities as feature vectors for SVM input. Radial basis function SVM classification was performed with default parameters using the software Gist, which allows users to apply a sophisticated machine-learning algorithm to the data.23
To quantitatively evaluate a DNA/RNA-binding protein at the proteome level, the discriminant value derived by SVM was defined as a novel index, the periodicity score (PD score), and was assigned to the representative protein data set of each of the six model species.
The performance of the PD score as a DNA/RNA-binding protein classifier was evaluated by applying the ROC curve to the representative set of P. furiosus proteins (Fig. 1). Sensitivity and specificity of the PD score overwhelmed that of various individual amino acids periodicities such as RK7, CDEGHKNQRSTY8, MNQ19, and AFILMPVW11. This demonstrated that a combination of amino acid periodicities as a feature vector optimizes the system for the classification of DNA/RNA-binding proteins.
|
To further validate the performance of PD score, we conducted a comparative analysis upon amino acid composition and SVM-Prot.27
|
3.3. Both CO score and PD score are required for efficient classification of DNA/RNA-binding protein predictions
For the efficient classification of DNA/RNA-binding proteins, we performed three different methods (early integration, intermediate integration, and late integration) for integrating heterogeneous data sets on the basis of the context of SVM learning.29
To provide further insights into the relative performance of CO and PD scores and to extract efficient DNA/RNA-binding protein candidates, we have chosen the late integration method and carried out two-dimensional correlation analysis on the basis of CO score and PD score upon 477 functionally known proteins in P. furiosus. The correlation coefficient was r = 0.75 (all functionally known proteins) and r = 0.55 (DNA/RNA-binding proteins only), respectively. The 157 DNA/RNA-binding proteins located at the right-upper region of the two-dimensional plot, suggesting that both CO and PD scores are required for classifying proteins with DNA or RNA-binding activity (red and blue circles in Fig. 2A). Then two different thresholds were determined on the basis of this newly defined CO score + PD score. First threshold is based on the highest overall accuracy (ACC) with a CO + PD score of 0.6 and the second threshold is based on the highest MCC with CO + PD score = 0.13. According to the Supplementary Fig. S1, the first threshold optimizes the extraction of reliable candidates for novel DNA/RNA-binding proteins (SE = 52.2%, SP = 96.3%, ACC = 81.8%, and PPV = 87.2%) and the second threshold optimizes the classification performance of CO + PD score (SE = 82.2%, SP = 80%, ACC= 80.7%, and PPV = 66.8%). On the basis of these thresholds, we classified proteins into three classes (class IIII) (Fig. 2A). As a result, a total of 94 proteins including 82 DNA/RNA-binding proteins (Fig. 2B) were categorized as class I proteins (CO + PD score > 0.6).
|
Further observation of DNA/RNA-binding proteins has revealed a region-specific distribution of ribosomal proteins and other DNA/RNA-binding proteins. Ribosomal proteins are strongly affected by CO score and are dominant at the high range of CO score (CO > 0.5). The CO score of other DNA/RNA-binding proteins ranged between 0 and 0.5 but some of them were dominant at high PD score region (0.251.5). As shown in Supplementary Table S1, this region includes 13 tRNA-processing enzymes (i.e. tRNA-synthetases, CCA-adding enzymes, and RNase P subunits), 11 DNA-binding proteins (i.e. DNA polymerase, DNA helicase, DNA primase, and reverse gyrase), three ribosomal proteins (i.e. ribosomal protein S3P and ribosomal protein L14e), and various transcription/translation-related proteins (i.e. SRP54, HTH-type transcriptional regulator, and transcription terminationanti-termination factor). We assume that PD score is an effective means of classifying DNA/RNA-binding proteins from a set of proteins, which cannot be distinguished by using amino acid compositions.
3.4. Selection and experimental verification of novel DNA/RNA-binding protein candidates
The same procedure was applied to the 994 hypothetical proteins in P. furiosus. The two-dimensional plot of hypothetical proteins was similar to that of functionally known proteins as well as the protein ratio in classes IIII (Fig. 2 versus Fig. 3). However, the number of proteins has decreased from the high CO score (CO score > 0.5) region, which is known to be dominated by ribosomal proteins. As a result, 994 hypothetical proteins were classified into three classes (IIII) owing to the CO + PD score thresholds (Supplementary Table S2), and a total of 151 proteins were classified as strong candidates for novel DNA/RNA-binding proteins.
|
In order to verify that hypothetical proteins in class I actually possess DNA/RNA-binding protein activities, we randomly chose 17 hypothetical proteins from classes IIII (nine from class I, five from class II, and three from class III, Table 3). All 17 recombinant proteins were overexpressed in E. coli and purified to near homogeneity (Fig. 4). To study the DNA/RNA-binding properties of the candidate proteins, we first carried out gel-shift assays using 5' FAM-labeled, 27 bp, multipotential oligoprobe RNA (MPOR-27) (Fig. 5A). MPORs potentially possess four different secondary RNA structures (stem, bulge, loop, and single strand), which encompass the currently known structures corresponding to the activities of various RNA-binding proteins. The three proteins, PF0871, PF0678, and PF0840, aggregated in the loading well, so we removed them from the final results. A prominent shift of the RNA probe up the gel was observed in candidate proteins PF0029, PF0030, PF0565, PF1139, PF1473, PF1580, PF1912, and PF2062 (Fig. 5B). Interestingly, the formation of certain nucleic acidprotein complexes appears to be temperature dependent. For example, PF1981 showed a significant shift at 75°C but not at 24°C (Fig. 5C versus 5B). PF0029, PF0030, PF1139, and PF1580 also showed binding affinity with the multipotential oligoprobe DNA, MPOD-27 (data not shown). No significant shifts were observed in PF0547, PF1142, PF1488, PF1498, or PF1913, though agarose gel analysis of purified PF1498 revealed it to be a potential proteinnucleic acid complex (Fig. 5D).
|
|
|
During our investigation, six out of seven class I proteins, three out of four class II proteins, and one class III protein have shown potential DNA/RNA-binding activities (Table 3). According to our previous works,8
According to the domain assignment of the InterPro/Pfam domain database20
,21
against P. furiosus proteome, 9598% of the 942 functionally known proteins possessed domains related to those with known function (functional domains). On the other hand, for 994 hypothetical proteins, only 3138% of the proteins possessed functional domains, 20% possessed domains of unknown function (DUF/UPF), and the remaining 4350% lacked domain annotation (Supplementary Fig. S2). According to Supplementary Table 3, among the newly discovered 10 DNA/RNA-binding proteins, at least four ORFans are detected (PF0029, PF0030, PF0565, and PF1981), which completely lacked sequence similarity (E-value > 0.1) compared with any of the Swiss-Prot protein entries. The remaining six proteins have shown sequential similarity to the uncharacterized proteins of nearest BLASTP hit (8.00e07 > E-value > 0.0), which were conserved among Pyrococcus and Methanococcus, including two proteins (PF1473 and PF2062) with no Pfam domain annotation. Hence, we believe that the combination of CO score and PD score is a powerful indicator for predicting proteins with potential DNA/RNA-binding activities from the sequence-specific ORFans and a set of proteins having no obvious functional domains, although allowing that the sample size of validated proteins is still small.
3.5. Possible explanation of charged amino acid periodicity with DNA/RNA-binding activities
As the amino acid composition within proteins varies among taxa,34
our method removes the need to allow for the evolutionary gain and loss of amino acids and increases the generalization capability of SVM training. Charged amino acids, especially basic amino acids, have previously been suggested as a key component of nucleic acid-binding activity; for example, arginine-rich regions of the Drosophila melanogaster suppressor of sable gene35
are thought to mediate specific RNA-binding activity. Similar features have been observed in the structural motifs of DNA/RNA-binding proteins that possess positive electrostatic potentials in the binding region.36
On the basis of electrostatic potential, negatively charged amino acids (DE) conflict with DNA/RNA-binding. However, recent work has revealed that negative peptide charges contribute significantly to the electrostatic free energy of positively charged peptides and affect RNA binding,37
suggesting the importance of not only basic regions but also, in some cases, acidic regions at the protein surface, for establishing DNA/RNA-binding functions. These results suggest that relative compositions of charged amino acids, especially basic amino acids, are very important for nucleic acid binding. In addition, our study has shown that certain class of DNA/RNA-binding proteins were efficiently classified by integrating amino acid periodicity with amino acid composition. Especially, charged amino acid periodicities have been observed throughout the protein sequence of various DNA/RNA-binding proteins, suggesting that not only amino acid compositions in the DNA/RNA-binding domain region but also the overall sequence feature of amino acid periodicity is useful for classifying DNA/RNA-binding proteins.
To gain insight into the relationship between charged amino acid periodicities and DNA/RNA-binding activity, a schematic representation of charged amino acid groups that appear periodically in the amino acid sequence of various DNA/RNA-binding proteins is given in Supplementary Fig. S3. The three proteins, signal recognition particle of 54 kDa subunit (SRP54), DNA primase, and HTH-type transcriptional regulator lrpA, were chosen as an example for their characteristic features of possessing low CO score (0.03 < CO score < 0.22) but relatively high PD score (0.56 < PD score < 0.88). An amino acid periodicity of both positively and negatively charged amino acids with various periodicities were widely found through protein primary sequence. The amino acid residues creating the periodicity (oblong boxes in Supplementary Fig. S3) are often conserved in the three-dimensional structures of orthologous proteins. Periodic region also covers DNA/RNA-recognition motif known as M domains and helix-turn-helix and active site of DNA primase. Our current study has shown that overall periodic features of charged amino acids throughout the protein primary structure may affect the organization of the secondary structures or the net charge of the protein surface in the tertiary structure in certain class of DNA/RNA-binding proteins. Further detailed analysis of the relationship between DNA/RNA-binding capacity and specific amino acid periodicity will be an important task with the help of other bioinformatics approaches such as the use of DNA/RNA-binding site prediction software,38
a comparative genomics approach that predicts function on the basis of the comparison of various domains,39
and three-dimensional protein models.40
In conclusion, we have presented a new method for predicting novel DNA/RNA-binding proteins at the proteome level by focusing on compositions and periodicities of amino acids with similar physico-chemical profiles (quantified as a novel index denoted as CO score and PD score). The two-dimensional correlation analysis of CO score and PD score effectively separated DNA/RNA-binding proteins from other functionally known proteins in P. furiosus as class I proteins. By applying the same method to the 994 hypothetical proteins, we extracted a list of 151 hypothetical proteins as novel DNA/RNA-binding protein candidates. Ten proteins with potential DNA/RNA-binding activities were determined experimentally, including four ORFans and two proteins with no domains. The two-dimensional correlation analysis of CO score and PD score is applicable to any organisms with complete genomic data. To conclude, our method is highly efficient for evaluating hypothetical proteins on the basis of DNA/RNA-binding function. The CO + PD scores can be further integrated with prediction results from various protein function predictors and annotation methods to validate uncharacterized proteins comprehensively. Further, the investigation of these newly discovered DNA/RNA-binding proteins might elucidate the role of undiscovered proteinDNA/RNA networks and the recognition of many non-conserved proteins throughout entire species.
| Supplementary Data |
|---|
|
|
|---|
Supplementary data are available online at http://dnaresearch.oxfordjournals.org.
| Acknowledgements |
|---|
|
|
|---|
We thank Asako Sato (Keio University, Japan) for technical assistance with the gel-shift assay. We also thank Jun Imoto, Nozomu Yachie, Shinichi Kikuchi, and Rintaro Saito (Keio University, Japan) for their helpful discussions. This research was supported in part by the Project for Development of a Technological Infrastructure for Industrial Bioprocesses in Research and Development of New Industrial Science and Technology Frontiers, the Ministry of Economy, Trade and Industry (METI), the New Energy and Industrial Technology Development Organization (NEDO) of Japan; a Grant-in-Aid for Scientific Research on Priority Areas; a Grant-in-Aid for the 21st Century Center of Excellence (COE) Program entitled Understanding and Control of Life's Function via Systems Biology (Keio University); the Computer Simulation Project, Ministry of Education, Culture, Sport, Science and Technology, Japan; and Keio University.
| Footnotes |
|---|
* To whom correspondence should be addressed. Tel. +81 235-29-0524. Fax. +81 235-29-0525. E-mail: akio{at}sfc.keio.ac.jp
| References |
|---|
|
|
|---|
- Pruitt K. D., Tatusova T., Maglott D. R. NCBI Reference Sequence RefSeq: a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. (2005) 33:D501504.
[Abstract/Free Full Text] - Altschul S. F., Madden T. L., Schaffer A. A., et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:33893402.
[Abstract/Free Full Text] - Pazos F., Sternberg M. J. Automated prediction of protein function and detection of functional sites from structure. Proc. Natl Acad. Sci. USA (2004) 101:1475414759.
[Abstract/Free Full Text] - McLaughlin W. A., Kulp D. W., de la Cruz J., Lu X. J., Lawson C. L., Berman H. M. A structure-based method for identifying DNA-binding proteins and their sites of DNA-interaction. J. Struct. Funct. Genomics. (2004) 5:255265.[CrossRef][Medline]
- Date S. V., Marcotte E. M. Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat. Biotechnol. (2003) 21:10551062.[CrossRef][ISI][Medline]
- Amiri H., Davids W., Andersson S. G. Birth and death of orphan genes in Rickettsia. Mol. Biol. Evol. (2003) 20:15751587.
[Abstract/Free Full Text] - Siew N., Fischer D. Structural biology sheds light on the puzzle of genomic ORFans. J. Mol. Biol. (2004) 342:369373.[CrossRef][ISI][Medline]
- Kanai A., Oida H., Matsuura N., Doi H. Expression cloning and characterization of a novel gene that encodes the RNA-binding protein FAU-1 from Pyrococcus furiosus. Biochem. J. (2003) 372:253261.[CrossRef][ISI][Medline]
- Kanai A., Sato A., Imoto J., Tomita M. Archaeal Pyrococcus furiosus thymidylate synthase 1 is an RNA-binding protein. Biochem. J. (2006) 393:373379.[CrossRef][ISI][Medline]
- Sato A., Kanai A., Itaya M., Tomita M. Cooperative regulation for Okazaki fragment processing by RNase HII and FEN-1 purified from a hyperthermophilic archaeon, Pyrococcus furiosus. Biochem. Biophys. Res. Commun. (2003) 309:247252.[CrossRef][ISI][Medline]
- Cotton J. L., Mykles D. L. Cloning of a crustacean myosin heavy chain isoform: exclusive expression in fast muscle. J. Exp. Zool. (1993) 267:578586.[CrossRef][ISI][Medline]
- Laskin A. A., Kudryashov N. A., Skryabin K. G., Korotkov E. V. Latent periodicity of serinethreonine and tyrosine protein kinases and other protein families. Comput. Biol. Chem. (2005) 29:229243.[CrossRef][ISI][Medline]
- Bhardwaj N., Langlois R. E., Zhao G., Lu H. Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res. (2005) 33:64866493.
[Abstract/Free Full Text] - Han L. Y., Cai C. Z., Lo S. L., Chung M. C., Chen Y. Z. Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA (2004) 10:355368.
[Abstract/Free Full Text] - Cai Y. D., Lin S. L. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim. Biophys. Acta. (2003) 1648:127133.[Medline]
- Yu X., Cao J., Cai Y., Shi T., Li Y. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J. Theor. Biol. (2006) 240:175184.[CrossRef][ISI][Medline]
- Ofran Y., Margalit H. Proteins of the same fold and unrelated sequences have similar amino acid composition. Proteins (2006) 64:275279.[CrossRef][ISI][Medline]
- Xie D., Li A., Wang M., Fan Z., Feng H. LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acids Res. (2005) 33:W105W110.
[Abstract/Free Full Text] - Sarda D., Chua G. H., Li K. B., Krishnan A. pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinformatics (2005) 6:152.[CrossRef][Medline]
- Apweiler R., Attwood T. K., Bairoch A., et al. InterProan integrated documentation resource for protein families, domains and functional sites. Bioinformatics (2000) 16:11451150.
[Abstract/Free Full Text] - Bateman A., Birney E., Durbin R., Eddy S. R., Howe K. L., Sonnhammer E. L. The Pfam protein families database. Nucleic Acids Res. (2000) 28:263266.
[Abstract/Free Full Text] - Gatherer D., Mc Ewan N. R. Analysis of sequence periodicity in E. coli proteins: empirical investigation of the duplication and divergence theory of protein evolution. J. Mol. Evol. (2003) 57:149158.[CrossRef][ISI][Medline]
- Pavlidis P., Wapinski I., Noble W. S. Support vector machine classification on the web. Bioinformatics (2004) 20:586587.
[Abstract/Free Full Text] - Kim S. K., Nam J. W., Rhee J. K., Lee W. J., Zhang B. T. miTarget: microRNA target gene prediction using a support vector machine. BMC Bioinformatics (2006) 7:411.[CrossRef][Medline]
- Yu C., Zavaljevski N., Stevens F. J., Yackovich K., Reifman J. Classifying noisy protein sequence data: a case study of immunoglobulin light chains. Bioinformatics (2005) 21:495501.[CrossRef]
- Goldbaum M. H., Sample P. A., Chan K., et al. Comparing machine learning classifiers for diagnosing glaucoma from standard automated perimetry. Invest. Ophthalmol. Vis. Sci (2002) 43:162169.
[Abstract/Free Full Text] - Cai C. Z., Han L. Y., Ji Z. L., Chen X., Chen Y. Z. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. (2003) 31:36923697.
[Abstract/Free Full Text] - Han L. Y., Cai C. Z., Ji Z. L., Cao Z. W., Cui J., Chen Y. Z. Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucleic Acids Res (2004) 32:64376444.
[Abstract/Free Full Text] - Pavlidis P., Weston J., Cai J., Noble W. S. Learning gene functional classifications from multiple data types. J. Comput. Biol (2002) 2:401411.
- Matthews B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta. (1975) 405:442451.[Medline]
- Robb F. T., Maeder D. L., Brown J. R., DiRuggiero J., Stump M. D., Yeh R. K., Weiss R. B., Dunn D. M. Genomic sequence of hyperthermophile, Pyrococcus furiosus: implications for physiology and enzymology. Meth. Enzymol. (2001) 330:134157.[ISI][Medline]
- Bairoch A., Apweiler R. The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic Acids Res. (1996) 24:2125.
[Abstract/Free Full Text] - Camon E., Magrane M., Barrell D., et al. The Gene Ontology Annotation, GOA project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res. (2003) 13:662672.
[Abstract/Free Full Text] - Jordan I. K., Kondrashov F. A., Adzhubei I. A., et al. A universal trend of amino acid gain and loss in protein evolution. Nature (2005) 433:633638.[CrossRef][Medline]
- Turnage M. A., Brewer-Jensen P., Bai W. L., Searles L. L. Arginine-rich regions mediate the RNA binding and regulatory activities of the protein encoded by the Drosophila melanogaster suppressor of sable gene. Mol. Cell. Biol. (2000) 20:81988208.
[Abstract/Free Full Text] - Shanahan H. P., Garcia M. A., Jones S., Thornton J. M. Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Res. (2004) 32:47324741.
[Abstract/Free Full Text] - Garcia-Garcia C., Draper D. E. Electrostatic interactions in a peptideRNA complex. J. Mol. Biol. (2003) 331:7588.[CrossRef][ISI][Medline]
- Ahmad S., Sarai A. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics (2005) 6:33.[CrossRef][Medline]
- Anantharaman V., Koonin E. V., Aravind L. Comparative genomics and evolution of proteins involved in RNA metabolism. Nucleic Acids Res. (2002) 30:14271464.
[Abstract/Free Full Text] - Manival X., Ghisolfi-Nieto L., Joseph G., Bouvet P., Erard M. RNA-binding strategies common to cold-shock domain- and RNA recognition motif-containing proteins. Nucleic Acids Res. (2001) 29:22232233.
[Abstract/Free Full Text]
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||




