Skip Navigation

DNA Research 2005 12(3):211-214; doi:10.1093/dnares/dsi007
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (11)
Right arrow Request Permissions
Google Scholar
Right arrow Articles by Vêncio, R. Z. N.
Right arrow Articles by Koide, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Vêncio, R. Z. N.
Right arrow Articles by Koide, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Kazusa DNA Research Institute

Short Communications

HTself: Self–Self Based Statistical Test for Low Replication Microarray Studies

Ricardo Z. N. Vêncio1,* and Tie Koide2

1BIOINFO-USP—Núcleo de Pesquisas em Bioinformática, Universidade de São Paulo Rua do Matão 1010, 05508-090 São Paulo, Brazil
2Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo Av. Prof. Lineu Prestes 748, 05508-090 São Paulo, Brazil

Received 7 January 2005; revised 13 April 2005


    Abstract
 Top
 Abstract
 Acknowledgements
 References
 
Different statistical methods have been used to classify a gene as differentially expressed in microarray experiments. They usually require a number of experimental observations to be adequately applied. However, many microarray experiments are constrained to low replication designs for different reasons, from financial restrictions to scarcely available RNA samples. Although performed in a high-throughput framework, there are few experimental replicas for each gene to allow the use of traditional or state-of-art statistical methods. In this work, we present a web-based bioinformatics tool that deals with real-life problems concerning low replication experiments. It uses an empirically derived criterion to classify a gene as differentially expressed by combining two widely accepted ideas in microarray analysis: self–self experiments to derive intensity-dependent cutoffs and non-parametric estimation techniques. To help laboratories without a bioinformatics infrastructure, we implemented the tool in a user-friendly website (http://blasto.iq.usp.br/~rvencio/HTself).

Key words: microarray; self–self; homotypical; web server; statistical test; low cost; differential gene expression


DNA microarray technology has allowed the study of gene expression in a genomic scale, changing the paradigm of expression studies of a single gene to a high-throughput framework. As this technology becomes cost accessible, more laboratories can use it as a routine technique.1Go By comparing two samples labeled with different fluorescent dyes, one can classify a gene as differentially expressed (or divergent, if dealing with genomic hybridizations) using a variety of statistical methods.2Go–4Go The ideal design of microarray experiments consists in having as many biological and technical replicates as possible, so that the data can be analyzed using state-of-art statistical tools. Unfortunately, it is not always possible to fulfill these replication requirements.

For instance, in laboratories with financial restrictions, the microarrays are used as a high-throughput screening tool. In this case, it is preferable to perform low replicated experiments and test different biological conditions. Another example is the study of rare human diseases. This kind of research is naturally constrained to low replication, since the RNA available usually comes from only one or two patients. Although not ideally replicated, these studies are undoubtedly important. However, they will not be properly analyzed using traditional or state-of-art statistical methods that require a number of replicates and assume certain hypothesis concerning the distribution of the samples that cannot be verified.

The aim of this work is to provide an easy-to-use bioinformatics solution for the analysis of microarrays constrained to low replication. To achieve our objective, we explored simultaneously two widely accepted ideas in microarray analysis: the determination of intensity-dependent cutoffs using self–self experiments5Go–7Go and the use of non-parametric methods.8Go–10Go Our contribution is to implement a web-based tool to help the analysis of microarray datasets with low replication designs. The web-based interface is freely available at http://blasto.iq.usp.br/~rvencio/HTself.

When analyzing a microarray data, a major question is how to classify a gene as differentially expressed. To answer this question, it is necessary to set a cutoff level for hybridization intensities ratios that permits one to decide whether a gene is differentially expressed or not. In mathematical terms, this step consists in testing the null hypothesis H0: ‘the spot has no differential hybridization between the two probed samples’.

There are many mathematical approaches to define cutoffs and reject H0.2Go–4Go A simple and widely used strategy consists in arbitrarily choosing a constant ratio, commonly the 2-fold change threshold. Spotted genes with ratios above this threshold are considered as differentially hybridized. To bring some statistical rigor, it is common to perform traditional statistical tests such as the t-test, using log-ratios and an arbitrary threshold. It will provide a p-value to access the significance level of the test for a given gene. To be adequately applied, one has to verify that the log-ratios for the given gene are normally distributed and that the number of observations is not scarce. Another approach is to assume a statistical model for the whole slide behavior (commonly a t-student like or a normal model), define it as the null probability density function (pdf) and search for outliers.2Go–4Go Again, this strategy requires the data to be distributed according to some known and arbitrarily proposed model. Since this assumption does not always hold for microarray data, different non-parametric procedures have been proposed to define the null pdf of the hybridization log-ratios for a given gene.8Go–10Go However, since they are usually based on resampling, permutation, standard deviation estimation, order/rank statistics, etc., it might not be a good choice to derive the pdf for an individual gene with few experimental observations.

Another category of approaches to define cutoffs relies on experimental strategies such as the use of self–self hybridizations. Self–self experiments are performed by labeling the same biological material with either Cy3 and Cy5 dyes and hybridizing them simultaneously on the same microarray slide. This strategy has been used to derive intensity-dependent cutoffs to classify a gene as differentially expressed5,Go6,Go11Go or divergent in comparative genomic hybridization (CGH) studies.7,Go12Go The comparative analysis of constant fold change cutoffs and intensity-dependent ones has been extensively discussed, showing a superior performance of the intensity-dependent strategy.5Go–7,Go11,Go12Go

In our tool, we make use of self–self experiments to derive the null probability density function of the test. Since the null hypothesis ‘there is no differential hybridization between the two probed samples’ holds for all the genes in a self–self experiment, it is possible to escape from the gene-by-gene schema and use all the spotted genes to derive the null pdf. With an adequate amount of observations (all the spotted genes), the use of non-parametric methods is now feasible. To take into account the intensity-dependent feature of the data, the null pdf is estimated in a user-defined sliding-window, which slides over all the range of the spots' intensity measure. This procedure results in the determination of intensity-dependent cutoffs that are readily applicable to non-self–self experiments. It is implicitly assumed that the same stochastic processes that generated the experimental noise in self–self experiments are acting in non-self–self data. Therefore, log-ratios above or below the intensity-dependent cutoffs can be classified as differentially expressed. The use of these experimentally derived cutoffs relaxes the requirement of replicates, since it does not count on standard deviation estimations, resampling or permutations. Moreover, it adds an empirically derived criteria to classify a gene as differentially expressed in studies constrained to low replication.

Our web-based tool expects a normalized data set as input. Microarray data usually must be normalized due to multiplicative biases such as unequal brightness of fluorescent dyes, unequal incorporation rate of dyes, etc. Such preprocessing procedures are well discussed and web-tools to address this problem are available elsewhere.13Go–15Go Next, we will describe the mathematical details behind our method.

Let A = log2(cy3)/2 + log2(cy5)/2 and M = log2(R), as usual in microarray analysis,16Go be the random variables of interest, where cy3 and cy5 are the fluorescence intensities and R is the suitably normalized intensities ratio. To represent our measurement, we prefer to use the M–A plot, where the variable A shows the dependence of the log-ratios on the average spot intensity. The procedure can be used with arbitrary reparametrizations of hybridization ratio and measurements of fluorescence intensities. An observation of a spot s is one realization of (A,M) and is denoted by (as,ms). Therefore, self–self hybridization measurements are samples drawn from the (A,M) bidimensional null joint pdf. To find the intensity-dependent log-ratio cutoffs, we first select a sliding-window in A, which is defined by the user. The observed spots (as, ms) contained in this window will be used to define the M|A null pdf locally. This pdf is estimated by applying the gaussian Kernel Density Estimator.

The Kernel Density Estimator is a model-free method that approximates the probability density function of a random variable using observations sampled from it.17Go Let f be the pdf of a random variable X and x1,...,xn, n observed samples. The estimator for f is

where the ‘hat’ over f indicates an estimator, h is the bandwidth and K is the kernel function. For example, a simple histogram can be described by a particular Kernel Density Estimator:

The gaussian Kernel Density Estimator is the most known and is the one used in our tool:

The formulae above can be intuitively interpreted as a smoothing process for the histogram.

After estimating the null pdf of M for a given A window, the user-defined credibility interval can be determined. In short, our algorithm to define intensity-dependent cutoffs is

  1. the user defines a sliding window for A axis inputting two parameters: the window size and the walking pace. Each step of the sliding window delimits an arbitrary subinterval of A;
  2. for each subinterval of A selected in (i), estimate the probability density function of M|A using gaussian Kernel Density Estimator;
  3. integrate the probability density function from (ii) around the mode until the user-defined probability is reached. The intervals obtained are called the {alpha} credibility intervals;
  4. the steps (ii) and (iii) are repeated until the window has slid over all the A range.

Figure 1 shows a snapshot of the algorithm in an arbitrarily chosen step. It was performed using the self–self data from a genomotyping study in the bacteria Xylella fastidiosa. Figure 2 shows the result of the self–self derived intensity-dependent cutoffs for this data. Since we know that there should not exist true differential hybridization in self–self experiments, it is clear that the commonly used 2-fold change would be conservative for high intensity spots and permissive for low intensity ones.



View larger version (16K):
[in this window]
[in a new window]
 
Figure 1. Snapshot of one step of the sliding window process. The left panel shows the MA-plot of the self–self data from a genomotyping CGH study.12Go The subinterval A considered in this snapshot is highlighted between the vertical lines. The histogram of M shown on the right was constructed using these highlighted observations. The Kernel Density Estimator (dark line) and the boundaries of the 99.5% credibility interval (vertical lines) are also shown. These boundaries define the intensity-dependent cutoffs, shown on the MA-plot (dark points) along with the results from previous steps. See all the steps in Supplementary Figures at the web-site (http://blasto.iq.usp.br/~rvencio/HTself).

 


View larger version (29K):
[in this window]
[in a new window]
 
Figure 2. Web-site output for intensity-dependent cutoff determination. The self–self data used as an input example for the web-based tool is derived from a Xylella fastidiosa CGH study.12Go The dark lines are the upper and lower cutoffs. They were obtained by the sliding window process using a 99.5% credibility level, 0.3 pace and 1.0 window size.

 
After defining the intensity-dependent cutoffs, different microarray experiments made within the same technical conditions of self–self data can be evaluated. For example, suppose that a spot measurement (a,m) shows a log-ratio m outside its intensity-dependent 99% credibility cutoff. It can be classified as a differentially expressed spot since there is just 1% of chance that its measured log-ratio is due to random technical errors. This hypothesis test is applied to all spots. Since the test is applied to an individual spot, it does not depend on the number of replicates. If one has a number of replicated observations for a given gene, after applying the test to each spot, it is possible to evaluate easily if they are above or below the intensity-dependent cutoff and classify the gene as differentially expressed. Our tool has been successfully applied to a recently published gene expression study in Xylella fastidiosa.18Go It can also be useful for CGH studies.12Go

To use many of the available statistical tools, it is necessary to have well-replicated designs. Although many efforts have been carried out to sample as many replicates as possible, sometimes it is still difficult to achieve a well-replicated design. Financial restrictions or even biological constraints concerning rare RNA samples do not allow some researchers to analyze their microarray data according to current statistical standards. With this web-based tool, we hope to help these researchers to extract the invaluable information from their datasets constrained to low replication.


    Acknowledgements
 Top
 Abstract
 Acknowledgements
 References
 
R.V. and T.K. were fellows of Fundação de Amparo a Pesquisa do Estado de Sao Paulo (FAPESP). We thank Aline M. da Silva, Paulo A. Zaini, Suely L. Gomes and Sergio Verjovski-Almeida for their involvement in the early stages of this work. We thank Diogo F.C. Patrão for his help with web-based tool implementation.


    Footnotes
 
*To whom correspondence should be addressed. Tel. +55-11-3091-6210, Fax. +55-11-3814-4135, Email: rvencio{at}vision.ime.usp.br

Both these authors contributed equally to this work

Communicated by Michio Oishi


    References
 Top
 Abstract
 Acknowledgements
 References
 

  1. Rockett, J. C. and Hellmann, G. M. 2004, Confirming microarray data—is it really necessary?, Genomics, 83, 541–549.[CrossRef][Web of Science][Medline]
  2. Nadon, R. and Shoemaker, J. 2002, Statistical issues with microarrays: processing and analysis, Trends Genet., 18, 265–271.[CrossRef][Web of Science][Medline]
  3. Stolovitzky, G. 2003, Gene selection in microarray data: the elephant, the blind men and our algorithms, Curr. Opin. Struct. Biol., 13, 370–376.[CrossRef][Web of Science][Medline]
  4. Cui, X. and Churchill, G. A. 2003, Statistical tests for differential expression in cDNA microarray experiments, Genome Biol., 4, R210.
  5. Yang, I. V., Chen, E., Hasseman, J. P., et al. 2002, Within the fold: assessing differential expression measures and reproducibility in microarray assays, Genome Biol., 3, R62.
  6. Tu, Y., Stolovitzky, G., Klein, U. 2002, Quantitative noise analysis for gene expression microarray experiments, Proc. Natl Acad. Sci. USA, 99, 14031–14036.[Abstract/Free Full Text]
  7. Kim, C. C., Joyce, E. A., Chan, K., Falkow, S. 2002, Improved analytical methods for microarray-based genome-composition analysis, Genome Biol., 11, R65.
  8. Tusher, V. G., Tibshirani, R., Chu, G. 2001, Significance analysis of microarrays applied to the ionizing radiation response, Proc. Natl Acad. Sci. USA, 98, 5116–5121.[Abstract/Free Full Text]
  9. Troyanskaya, O. G., Garber, M. E., Brown, P. O., Botstein, D., Altman, R. B. 2002, Nonparametric methods for identifying differentially expressed genes in microarray data, Bioinformatics, 18, 1454–1461.[Abstract/Free Full Text]
  10. Zhao, Y. and Pan, W. 2003, Modified nonparametric approaches to detecting differentially expressed genes in replicated microarray experiments, Bioinformatics, 19, 1046–1054.[Abstract/Free Full Text]
  11. Papini-Terzi, F. S., Rocha, F. R., Vêncio, R. Z. N., et al. 2005, Transcription profiling of signal transduction-related genes in sugarcane tissues, DNA Res., 12, 27–38.[Abstract]
  12. Koide, T., Zaini, P. A., Moreira, L. M., et al. 2004, DNA microarray-based genome comparison of a pathogenic and a non-pathogenic strain of Xylella fastidiosa delineates genes important for bacterial virulence, J. Bacteriol., 186, 5442–5449.[Abstract/Free Full Text]
  13. Park, T., Yi, S. G., Kang, S. H., Lee, S., Lee, Y. S., Simon, R. 2003, Evaluation of normalization methods for microarray data, BMC Bioinformatics, 4, 33.[CrossRef][Medline]
  14. Uchida, S., Nishida, Y., Satou, K., Muta, S., Tashiro, K., Kuhara, S. 2005, Detection and normalization of biases present in spotted cDNA microarray data: a composite method addressing dye, intensity-dependent, spatially-dependent, and print-order biases, DNA Res., 12, 1–7.[Abstract]
  15. Vaquerizas, J. M., Dopazo, J., Diaz-Uriarte, R. 2004, DNMAD: web-based diagnosis and normalization for microarray data, Bioinformatics, 20, 3656–3658.[Abstract/Free Full Text]
  16. Dudoit, S., Yang, Y. H., Callow, M. J., Speed, T. P. 2002, Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Statistica Sinica, 12, 111–139.[Web of Science]
  17. Silverman, B. W. 1986, Density Estimation, London, UK Chapman and Hall.
  18. Pashalidis, S., Moreira, L. M., Zaini, P. A., et al. 2004, Whole-genome expression profiling of Xylella fastidiosa in response to growth on glucose, OMICS, 9, 77–90.

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Eukaryot CellHome page
S. M. Salem-Izacc, T. Koide, R. Z. N. Vencio, and S. L. Gomes
Global Gene Expression Analysis during Germination in the Chytridiomycete Blastocladiella emersonii
Eukaryot. Cell, February 1, 2009; 8(2): 170 - 180.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
P. A. Zaini, A. C. Fogaca, F. G. N. Lupo, H. I. Nakaya, R. Z. N. Vencio, and A. M. da Silva
The Iron Stimulon of Xylella fastidiosa Includes Genes for Type IV Pilus and Colicin V-Like Bacteriocins
J. Bacteriol., April 1, 2008; 190(7): 2368 - 2378.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (11)
Right arrow Request Permissions
Google Scholar
Right arrow Articles by Vêncio, R. Z. N.
Right arrow Articles by Koide, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Vêncio, R. Z. N.
Right arrow Articles by Koide, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?