(This is a very old post and I am no longer at CNIO. But it might still be useful.)
Statistical Consulting at CNIO: Usage rules
We are very happy to collaborate in the analysis of data, and we think it can be an enriching experience for all involved. However, to help in the process, maximize its efficiency, rationalize our workload, and minimize misunderstandings, we have prepared these guidelines. We would appreciate if you could read them before coming to us.
Index
- (Our) General philosophy about statistical consulting
- Objectives of the study
- General procedure (and scheduling) when getting help with the statistical analyses
- “Homework” you should do before coming
- Data analysis: requirements
- Data analysis: general guidelines
- Software
- Some references
(Our) General philosophy about statistical consulting
A lot has been written about statistical consulting, and we have read a little bit of it. Just judging by the number of pages this issue still generates (e.g., there is a publication, The Statistical Consultant, devoted solely to this topic), this is a touchy and difficult subject. So please bear with us for a few more paragraphs.
Simplifying a lot, we can think of three different styles at CNIO.
- A quick telephone question.
- Some very specific analyses, often with little background information. Things such as "do analyses x, y, and then do z". For instance, considering us as data analysts with the specific task of prefiltering a set genes followed by a bunch of Cox models on those with p < 0.05. Period. No questions asked.
- Analyses where we get involved to try and solve a scientific question. Sometimes, after a short meeting, we (i.e., we and you) find out that a t-test is all we need; at other times, we will need a more complex set of analyses that will require us to do some programming, ad-hoc modification of existing procedures, or even development of new approaches; and at other times, something in between the two extremes.
Comments about the above three styles:
- This is a deceptively "safe" type of problems. A question such as "should I use a tab or a space when I export the file" is reasonably safe. But a question such as "Is Cox model appropriate for survival data?" probably signals trouble: we will answer the question, but the fact that the question is being asked indicates lack of familiarity with survival analysis, and probably predicts that no model assessment will be performed, regardless of admonishments.
- This is the most unsatisfactory type of problems for us: we run the risk of being regarded as "data blessers" (in Hunter's terminology), something like the person that ought to provide the "nihil obstat" so that reviewers will shut up, or "shoe clerks" (in Bross's terminology), like the guy who will provide you with the analyses you demand, few questions asked. In these situations we have little control over what we are doing and little hope of being able to do a sensible statistical job for the complete scientific enterprise. We try to avoid these types of situations whenever possible. We will politely ask you to do these analyses yourself.
- This is what we (and virtually all the statisticians we know) prefer. Remember that besides doing (or telling how to do) test x or y, statisticians can often be of help when trying to clarify the study objectives or the hypotheses being considered (see also Objectives of the study). Ideally, in these setting we see ourselves involved in a collaborative enterprise, rather than in a "client-consultant" relationship. We think we all gain from this relationship: you get the best of us, we all enjoy things a lot more, we can end up with new research projects, etc, etc. Of course, the size and magnitude of the project varies, from a quick thing that gets settled in a few hours, to a long-term relation that spans several papers.
A few side notes:
- Some problems can’t be solved, some questions can’t be answered (cited in Mann et al., 1999, The Statistical Consultant). These are problems that might be overcome (e.g., different experimental design), or might not. Please, it does not help to try to get us to solve a problem that can’t be solved.
- It is very disconcerting and disheartening to get involved in a project only to find out that, after some weeks of work, the collaborator uses those analyses on page 4 and 8, discards the rest, and substitutes the analyses on pages 1 and 3 by this method that she just read in a paper. This is not very polite, and suddenly turns us into “statistical shoe clerks”. Moreover, this ignores the fact that, when we get involved into a scientific inquiry process, we try to provide a (complete) answer that will make sense; this often means that the sequence of analyses has some internal logic; taking pieces, like if they were disconnected screws and bolts, will most likely yield a final product that does not make sense. This of course affects our attitude towards coauthorship.
Objectives of the study
We do not perform analyses before the objectives are spelled out: we require that we all know what you want to do with your data. In other words, that you know what question you are asking. If not, then we will need a first meeting to thoroughly discuss your objectives, hypotheses under consideration, etc. We are glad to help with this step. Please, be ready to provide us with background information, relevant biological details, and to endure thousands of apparently trivial (and silly) questions about procedures, design, previous results, etc. Before helping, we must be sure we understand the problem.
Some studies are clearly testing a very specific hypothesis, other studies are more exploratory, searching, for example, for candidate genes. But, in all cases, there is an explicit question behind the study. We do not like to get involved in studies that try to torture the data till they confess something, whatever that might be.
At the beginning of the project, we will ask you to provide a written description of the objectives of the project. This description will be filed together with the rest of the logs and files of the project.
General procedure (and scheduling) when getting help with the statistical analyses
- Send us an email explaining what you want (see Objectives of the analyses). (You can write in either Spanish or English).
- It is likely that we will need to exchange a few emails, to make sure we understand the problem.
- We will then schedule a meeting, to make sure we understand everything.
- After or during the meeting, from bioinfo we will take some notes as a final summary of that meeting, with the basic points of what it is that needs to be analyzed and how.
- If you give us the OK, then your project will be added to the “Waiting list”, at bioinfo-stats-consult-list.
- Scheduling: We generally attend to projects in a first come, first served basis. Additionally, other things being equal, projects in areas we are more familiar with receive higher priority, since they get completed faster. An exception to these rules might be small analyses that only require a few hours of work. However, to prevent unpleasant situations, schedule things with enough time. We will most likely be unable to help you if you “need the analyses for this Friday”. Please, don’t ask us to give more priority to your project than to someone else’s project: this would be unfair (of course, you can —should?— negotiate with someone ahead of you on the waiting list if your project is very urgent).
- More on scheduling: we do not like to work on more than 3 to 5 projects at the same time. So any other requests will go directly to the waiting list. Please remember that, besides consulting, we are also supposed to provide internal support to the Bioinformatics Unit, do software development of Bioinformatics applications, and do our own research.
“Homework” you should do before coming
- Be ready to provide us (via email) with a written description of your project, your general (biological) objectives, and what questions you want to answer with the analyses. (See also Objectives of the analyses).
- If your question involves any of our tools, make sure you have carefully read the documentation.
- Look at the CNIO-stats-FAQ. Some questions or issues might have been dealt before.
- We should not deal with file formatting problems: it is your responsibility to ensure that your data files are formatted correctly, and we will politely let you know. (Of course, our programs do have bugs, that can affect data input, and we love to hear about these; but a bug and a documented feature are two very different things).
Data analysis: requirements
- Design of experiments/studies
It is much, much better if you ask us before you start your study. In particular, you will probably want to discuss issues related to study design, variables to record, type of microarray design (controls, dye-swaps, pooling, etc), etc. Otherwise, we might not be able to help you at all with your data: it might be impossible to analyze data from poorly designed studies. We want to emphasize this: many studies are wasted time, money, and effort because they do not allow for sensible statistical analyses.
(This is from R. A. Fisher himself, in an address to the First Indian Statistical Congress, in 1938: "To consult a statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.")
We are very happy to discuss with you alternative designs for your prospective studies.
- Prenormalized data
We expect the cDNA microarray data to have been properly normalized. Unless you can justify otherwise, we require the data to have been normalized using print-tip loess. You can use Bioconductor or similar tools, or you can use our dnmad tool. Please, remember not to normalize the data with the GenePix default: save the raw GPR file without normalization.
- File format
The data should be given to us as plain ASCII files, with columns separated with tabulators (tabs). Do not send us Excel files. If you use Excel and then export as ASCII, please ensure that all rows have the same number of columns.
Besides the microarray data, you will often have additional phenotypic data. Please, send this as a text file (with tabs), were there is one row per subject (or array, or experimental unit) and one column for each variable. For instance, if you have five subjects, and there are three variables (age, sex, hospital), we expect to receive a matrix of five rows and three columns, with an additional first column for the subject ids and a first row with the column names. (Yes, we are aware this format is the transpose of the Pomelo format, for example.)
- Sending the data
You can provide us the data using:
- Email, for small files; if you have large text files, you can often compress them successfully. Please, do not send as any email message that has an attachment larger than 5 MB.
- A CD (we will return it to you).
Please, do not leave the data in the "network drives" (it is very hard for us to access them from GNU/Linux) nor bring us ZIP drives (most of our machines do not have ZIP drives).
Data analysis: general guidelines
- Methods we prefer
We prefer well established, theoretically justified methods, rather than fancy new algorithms that lack statistical justification. We are happy to discuss (and learn!) different methods/approaches, but please, do not try to coax us to use the fancy yy method that such and such just published, if we are telling you that we'd rather use discriminant analysis. Of course, you are free to do the analyses on your own (but we will not do the programming for you).
By now, we expect not to have to convince you that multiple testing must be taken into account when screening large numbers of genes for "significant differences". If you still have doubts, please check the references mentioned in the help of Pomelo.
We also expect that you are convinced that it is necessary to obtain honest, unbiased, estimates of the performance of any predictor you build. This, of course, involves including gene selection in the cross-validation (if you have used gene selection). We will soon have a tool to help you with this task. In the meantime, you might want to read Ambroise & McLachlan, 2002 (PNAS, 99: 6562--6566) and Simon et al., 2003 (JNCI, 95: 14--18). If we do help you with the building of a predictor, using cross-validation or bootstrapping to obtain estimates of the error rate will be an integral part of our work.
As we mentioned above, some of our choices, options, beliefs and preconceptions are discussed in the CNIO-stats-FAQ. You probably want to look at it.
Some of our recommendations and preferences change over time, because methodology advances, availability of method improves, and our own understanding gets better.
If your analyses fall into an area we are not familiar with, they might take longer than usual, and be given lower priority (see also Scheduling)
- Coauthorships
As many analyses are rather involved, we might want to discuss coauthorship. But this also means that we take full responsibility for the use of our analysis (see also the comments about internal logic of the sequence of analyses). Thus, please do not be offended if we ask you to remove our name from the list of coauthors if we disagree with the analyses or their presentation. As well, if we think our contribution does not warrant coauthorship we will ask you to not add us to the list of coauthors. We also appreciate if you also ask for permission before adding our names to the acknowledgements section.
- Outcome of the analyses
Sometimes, there is nothing that can be said from a data set; the data might be
too noisy, there might be too many missing data, or the phenomenon you
expected might not be present. Please, do not ask us to continue torturing
the data until "something" comes out that will allow you to write a paper.
If, after the analyses, you think that the study could be refocused and reanalyzed in a different way, we can do that, and it will be considered a "new submission" (so we go to step 1 of "General procedure (and scheduling) when getting help …".)
Software
We cannot provide support for any kind for software we do not develop. We can, occasionally, help you with software that we are very familiar with (e.g., R, Bioconductor), but this should be the exception rather than the rule; if you think you might want to use R or Bioconductor, come to our courses. Therefore, we do not support programs such as Excel, SPSS, SAS, Acuity, GeneCluster, etc.Some references
Bross, I.D.J. 1974. The Role of the Statistician: Scientist or Shoe Clerk. The American Statistician, 28: 126–127.
Browne, R. 1996. Tips for Beginning Consultants. The Statistical Consultant, 13 (1): 8–10. (Download).
Finch, H. 1999. Client Expectations in University Statistical Consulting Lab. The Statistical Consultant, 16 (3): 5–9. (Download).
Finch, H. 1999. Client Perceived Pitfalls in Statistical Consulting: An Ethnographic Study. The Statistical Consultant, 18 (1): 9–11. (Download).
Hunter, W.G. 1981. The Practice of Statistics: the Real World is an Idea Whose Time Has Come. The American Statistician, 35: 72–76.
Ittenbach, R.F., Tsai, Y.-J., and Billingsley, C. 1996. Consultation in the Social Sciences: An Itegrated Model for Training and Service. The Statistical Consultant, 13 (3):2–5. (Download).
Kirk, R.E. 1991. Statistical Consulting in a University –dealing with people and other challenges. The American Statistician, 45: 28–34.
Mann, B.L., Quinn, L., Boardman, T., Bishop, T., and Gaydos, B. 1999. What my Mother Never Told Me: Learning the Hard Way. The Statistical Consultant, 16 (3): 2–5. (Download).
Strickland, H. 1996. The Nature of Statistical Consulting. The Statistical Consultant, 13 (2): 2–5. (Download).
Tweedie, R. and Taylor, S. 1998. Consulting: Real Problems, Real Interactions, Real Outcomes. Statistical Science, 13: 1–3. (Download).
Young, S.S. 2001. Industy/Academic Statistics Collaborations. The Statistical Consultant, 18 (1): 2–6. (Download).