- Introduction
- Getting statistical help
- Software
- Miscell. stats questions (THESE are the ones that most likely interest you).

This document contains answers to some questions commonly asked about statistical analyses at CNIO. The most interesing part is the last one, where we collect miscellaneous questions and answers about commons statistical issues at CNIO. It is much appreciated if you can check whether your question has already been answered. This page is eternally under construction.

- Software from the Bioinformatics Unit
- Other software for microarray data analysis
- General purpose statistical software
- Help with statistical software

Lots; a search in Google will be overwhelming. Our favourite is, first and foremost, the Bioconductor set of packages. Bioconductor runs under R and most of these packages are command-drive; thus, mouse-clicking will not take you very far. Therefore, Bioconductor tends to be used by people who think the learning effort is worth it (we do think it is worth it!).

Another useful program, that runs under Windows and might be a little bit more user friendly are the BRB Array Tools.

Lots too. Of course, you have the usual commercial options (SPSS, SAS, Statistica, etc). But you might want to give a try to free software. Our favourite is, without doubt, R; we've used it for quite some time now, use it for virtually all of our statistical analyses and most of our programming, and have taught some courses. Learning R takes some tyme. There are some GUIs available; we don't use them much, but from our limited experience with R GUIs the one we like best is Rcmdr, from John Fox; Rcmdr will run with a very similar "look-and-feel" in GNU/Linux and Windows; we can help you get it up and running.

Another free statistical system with a GUI is Arc, which is built upon Xlisp-Stat. Arc is particularly nice for regression diagnostics. Also runs under both GNU/Linux and Windows.

- Multiple testing: do I need to worry about it?
- t-test then cluster illusion
- cluster then t-test illusion
- I suspect I have several groups, but I don't know which nor which are the relevant genes
- Some comments about p-values
- Pooling RNA
- Dye-swaps or always controls with same dye?
- Other lists with prejudices
- On consulting a statistician before or after the experiment

- Run a t-test on all the genes of the array
- Cluster subjects using as variables only those genes that have an (unadjusted) p-value less than a given threshold (e.g., 0.05)
- Bingo, now you see two perfectly separated groups in your cluster

These t-tests, and their p-values, are meaningless. Please, don't claim they are "statistically significantly different" between the two groups, because the very two groups are defined using the very genes you are then testing for differences…

Sure, you have a question here about which genes are important for the clustering, but this is not the way to approach the problem. There are some references on variable selection for clustering. In the OVW page you can find a paper with references to the literature and software. You might also want to consider biclustering (see also this question).

P-values are often misintrepreted. This comes from the help of Pomelo:

(Recall that ap-valueis the probability, under the null hypothesis [in our case, the "natural" null would be that there are no differences between the two classes in the level of gene expression] of obtaining a value of the test statistic as extreme as, or more extreme than, the one observed in the sample. Smallp-valuesprovide evidence against the null hypothesis, and in the "Fisherian tradition" of p-values as strength of evidence against the null ap-valuebetween 0.05 and 0.01 is considered some evidence against the null, a value between 0.01 and 0.001 is usually considered strong evidence against the null, and a value less than 0.001 is usually considered very strong evidence against the null. Note, however, that p-values can sometimes be a tricky business, and there is quite a bit of misunderstanding about what they really mean (e.g., misinterpreting the Fisherian approach to p-values as evidence as a frequentist statement, or the importance of tail behavior, or trying to give a bayesian-like "probability of the null" interpretation); a nice review of some of these issues can be found in a paper by J. Berger.).

And it also gets complicated, because our null often includes more than we would like it to include. The following two links are from a recent exchange at the Bioconductor email list:

- Newton's and Irizarry's
- Gordon Smith's p-value questioning (see last part of message)

Some references include Peng et al., 2003 and Kendziorski, et al., 2003, Biostatistics , 4: 465-477, 2003. A quick summary of some issues can be found in these two email messages from the Bioconductor email list:

In many experiments we have to decide whether to use dye-swap (i.e., each treatment gets Cy3 and Cy5, in different slides, of course) or to always have the control with one dye (e.g., Cy3) and the experimentals with the other. Some papers on experimental design for arrays deal with these issues.

Two excelent papers, very well written and easy to read are:

Slightly more technical are:And from the Bioconductor list:

Or why you should ask a statistician BEFORE you start your experiment, in the words of R. A: Fisher, in his address to the First Indian Statistical Congress, in 1938:

"To consult a statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of."