CNIO-stats-FAQ: Frequently Asqued Statistical Questions at CNIO


  1. Introduction
  2. Getting statistical help
  3. Software
  4. Miscell. stats questions (THESE are the ones that most likely interest you).

1. Introduction

This document contains answers to some questions commonly asked about statistical analyses at CNIO. The most interesing part is the last one, where we collect miscellaneous questions and answers about commons statistical issues at CNIO. It is much appreciated if you can check whether your question has already been answered. This page is eternally under construction.


2. Getting statistical help

We provide statistical consulting at CNIO. Please check the statistical consulting at CNIO usage rules.

3. Software


3.1 Software from the Bioinformatics Unit

A variety of tools are available, including clustering, searching for differentially expressed genes, Gene Ontology-related tools, normalization of microarray data, etc. Check the GEPAS tools.

3.2 Other software for microarray data analysis

Lots; a search in Google will be overwhelming. Our favourite is, first and foremost, the Bioconductor set of packages. Bioconductor runs under R and most of these packages are command-drive; thus, mouse-clicking will not take you very far. Therefore, Bioconductor tends to be used by people who think the learning effort is worth it (we do think it is worth it!).

Another useful program, that runs under Windows and might be a little bit more user friendly are the BRB Array Tools.


3.3 General purpose statistical software

Lots too. Of course, you have the usual commercial options (SPSS, SAS, Statistica, etc). But you might want to give a try to free software. Our favourite is, without doubt, R; we've used it for quite some time now, use it for virtually all of our statistical analyses and most of our programming, and have taught some courses. Learning R takes some tyme. There are some GUIs available; we don't use them much, but from our limited experience with R GUIs the one we like best is Rcmdr, from John Fox; Rcmdr will run with a very similar "look-and-feel" in GNU/Linux and Windows; we can help you get it up and running.

Another free statistical system with a GUI is Arc, which is built upon Xlisp-Stat. Arc is particularly nice for regression diagnostics. Also runs under both GNU/Linux and Windows.


3.4 Help with statistical software

As explained in the statistical consulting at CNIO usage rules we are unable to provide help with any software, except that developed by us; note that we ocassionally teach courses about the usage of GEPAS and related programs. We may, ocassionally, help with R (installation, and some basic usage). Please, come to our courses if you think you will want to use R or Bioconductor.

4. Miscell. stats questions


4.1 Multiple testing: do I need to worry about it?

If you are asking, then you most likely do need to worry. There are lots of good introductions available. Check Pomelo's help page for links and papers.

4.2 t-test then cluster illusion

Some people do something like:
  1. Run a t-test on all the genes of the array
  2. Cluster subjects using as variables only those genes that have an (unadjusted) p-value less than a given threshold (e.g., 0.05)
  3. Bingo, now you see two perfectly separated groups in your cluster
The "great clusters" are an illusion. It is trivial to get great results using completely random data; R code to show this is provided in the example files for the R course taught at CNIO. And the statistical explanation is also trivial. In summary, this procedure shows nothing of relevance (except how simple it is to capitalize on chance and obtain aparantly great results).

4.3 cluster then t-test illusion

Its like the specular image of the above, but it is still an illusion. The idea is something like, using a set of samples and genes, you cluster the samples, divide your sample into two groups using the dendrogram as guidance, and then find the genes (among the genes originally used for the clustering) that are "significant", using a t-test.

These t-tests, and their p-values, are meaningless. Please, don't claim they are "statistically significantly different" between the two groups, because the very two groups are defined using the very genes you are then testing for differences…

Sure, you have a question here about which genes are important for the clustering, but this is not the way to approach the problem. There are some references on variable selection for clustering. In the OVW page you can find a paper with references to the literature and software. You might also want to consider biclustering (see also this question).


4.4 I suspect I have several groups, but I don't know which nor which are the relevant genes

A common question. Techniques such as clustering on subsets of attributes could be relevant here. You might want to check the Clustering on Subsets of Attributes page, and the Plaid model page. Other references include Tanay et al. (see also the Samba program) and references therein. The Plaid model and the COSA approach are interesting to us; download (beware 7.7 MB!) a talk I gave at the System's Biology seminar about these issues.

4.5 Some comments about p-values

P-values are often misintrepreted. This comes from the help of Pomelo:

(Recall that a p-value is the probability, under the null hypothesis [in our case, the "natural" null would be that there are no differences between the two classes in the level of gene expression] of obtaining a value of the test statistic as extreme as, or more extreme than, the one observed in the sample. Small p-values provide evidence against the null hypothesis, and in the "Fisherian tradition" of p-values as strength of evidence against the null a p-value between 0.05 and 0.01 is considered some evidence against the null, a value between 0.01 and 0.001 is usually considered strong evidence against the null, and a value less than 0.001 is usually considered very strong evidence against the null. Note, however, that p-values can sometimes be a tricky business, and there is quite a bit of misunderstanding about what they really mean (e.g., misinterpreting the Fisherian approach to p-values as evidence as a frequentist statement, or the importance of tail behavior, or trying to give a bayesian-like "probability of the null" interpretation); a nice review of some of these issues can be found in a paper by J. Berger.).

And it also gets complicated, because our null often includes more than we would like it to include. The following two links are from a recent exchange at the Bioconductor email list:

4.6 Pooling RNA

Some references include Peng et al., 2003 and Kendziorski, et al., 2003, Biostatistics , 4: 465-477, 2003. A quick summary of some issues can be found in these two email messages from the Bioconductor email list:

4.7 Dye-swaps or always controls with same dye?

In many experiments we have to decide whether to use dye-swap (i.e., each treatment gets Cy3 and Cy5, in different slides, of course) or to always have the control with one dye (e.g., Cy3) and the experimentals with the other. Some papers on experimental design for arrays deal with these issues.

Two excelent papers, very well written and easy to read are:

Slightly more technical are:

And from the Bioconductor list:

4.8 Other lists with prejudices

Terry Speed's groups "Hints and Prejudices".

4.9 On consulting a statistician before or after the experiment

Or why you should ask a statistician BEFORE you start your experiment, in the words of R. A: Fisher, in his address to the First Indian Statistical Congress, in 1938:

"To consult a statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of."


Frequently Asqued Statistical Questions at CNIO, Version 0.2, 2004-01-26

Ramón Díaz-Uriarte