Statistical Consulting at CNIO: Usage rules

We are very happy to collaborate in the analysis of data, and we think it can be an enriching experience for all involved. However, to help in the process, maximize its efficiency, rationalize our workload, and minimize misunderstandings, we have prepared these guidelines. We would appreciate if you could read them before coming to us.

Index

  1. (Our) General philosophy about statistical consulting
  2. Objectives of the study
  3. General procedure (and scheduling) when getting help with the statistical analyses
  4. "Homework" you should do before coming
  5. Data analysis: requirements
    1. Design of experiments/studies
    2. Prenormalized data
    3. File format
    4. Sending the data
  6. Data analysis: general guidelines
  7. Software
  8. Some references

(Our) General philosophy about statistical consulting

A lot has been written about statistical consulting, and we have read a little bit of it. Just judging by the number of pages this issue still generates (e.g., there is a publication, The Statistical Consultant, devoted solely to this topic), this is a touchy and difficult subject. So please bear with us for a few more paragraphs.

Simplifying a lot, we can think of three different styles at CNIO.

  1. A quick telephone question.
  2. Some very specific analyses, often with little background information. Things such as "do analyses x, y, and then do z". For instance, considering us as data analysts with the specific task of prefiltering a set genes followed by a bunch of Cox models on those with p < 0.05. Period. No questions asked.
  3. Analyses where we get involved to try and solve a scientific question. Sometimes, after a short meeting, we (i.e., we and you) find out that a t-test is all we need; at other times, we will need a more complex set of analyses that will require us to do some programming, ad-hoc modification of existing procedures, or even development of new approaches; and at other times, something in between the two extremes.
Comments about the above three styles:
  1. This is a deceptively "safe" type of problems. A question such as "should I use a tab or a space when I export the file" is reasonably safe. But a question such as "Is Cox model appropriate for survival data?" probably signals trouble: we will answer the question, but the fact that the question is being asked indicates lack of familiarity with survival analysis, and probably predicts that no model assessment will be performed, regardless of admonishments.
  2. This is the most unsatisfactory type of problems for us: we run the risk of being regarded as "data blessers" (in Hunter's terminology), something like the person that ought to provide the "nihil obstat" so that reviewers will shut up, or "shoe clerks" (in Bross's terminology), like the guy who will provide you with the analyses you demand, few questions asked. In these situations we have little control over what we are doing and little hope of being able to do a sensible statistical job for the complete scientific enterprise. We try to avoid these types of situations whenever possible. We will politely ask you to do these analyses yourself.
  3. This is what we (and virtually all the statisticians we know) prefer. Remember that besides doing (or telling how to do) test x or y, statisticians can often be of help when trying to clarify the study objectives or the hypotheses being considered (see also Objectives of the study). Ideally, in these setting we see ourselves involved in a collaborative enterprise, rather than in a "client-consultant" relationship. We think we all gain from this relationship: you get the best of us, we all enjoy things a lot more, we can end up with new research projects, etc, etc. Of course, the size and magnitude of the project varies, from a quick thing that gets settled in a few hours, to a long-term relation that spans several papers.
A few side notes:

Objectives of the study

We do not perform analyses before the objectives are spelled out: we require that we all know what you want to do with your data. In other words, that you know what question you are asking. If not, then we will need a first meeting to thoroughly discuss your objectives, hypotheses under consideration, etc. We are glad to help with this step. Please, be ready to provide us with background information, relevant biological details, and to endure thousands of apparently trivial (and silly) questions about procedures, design, previous results, etc. Before helping, we must be sure we understand the problem.

Some studies are clearly testing a very specific hypothesis, other studies are more exploratory, searching, for example, for candidate genes. But, in all cases, there is an explicit question behind the study. We do not like to get involved in studies that try to torture the data till they confess something, whatever that might be.

At the beginning of the project, we will ask you to provide a written description of the objectives of the project. This description will be filed together with the rest of the logs and files of the project.

General procedure (and scheduling) when getting help with the statistical analyses

  1. Send us an email explaining what you want (see Objectives of the analyses). (You can write in either Spanish or English).
  2. It is likely that we will need to exchange a few emails, to make sure we understand the problem.
  3. We will then schedule a meeting, to make sure we understand everything.
  4. After or during the meeting, from bioinfo we will take some notes as a final summary of that meeting, with the basic points of what it is that needs to be analyzed and how.
  5. If you give us the OK, then your project will be added to the "Waiting list", at bioinfo-stats-consult-list.
  6. Scheduling: We generally attend to projects in a first come, first served basis. Additionally, other things being equal, projects in areas we are more familiar with receive higher priority, since they get completed faster. An exception to these rules might be small analyses that only require a few hours of work. However, to prevent unpleasant situations, schedule things with enough time. We will most likely be unable to help you if you "need the analyses for this Friday". Please, don't ask us to give more priority to your project than to someone else's project: this would be unfair (of course, you can ---should?--- negotiate with someone ahead of you on the waiting list if your project is very urgent).
  7. More on scheduling: we do not like to work on more than 3 to 5 projects at the same time. So any other requests will go directly to the waiting list. Please remember that, besides consulting, we are also supposed to provide internal support to the Bioinformatics Unit, do software development of Bioinformatics applications, and do our own research.

"Homework" you should do before coming

  1. Be ready to provide us (via email) with a written description of your project, your general (biological) objectives, and what questions you want to answer with the analyses. (See also Objectives of the analyses).
  2. If your question involves any of our tools, make sure you have carefully read the documentation.
  3. Look at the CNIO-stats-FAQ. Some questions or issues might have been dealt before.
  4. We should not deal with file formatting problems: it is your responsibility to ensure that your data files are formatted correctly, and we will politely let you know. (Of course, our programs do have bugs, that can affect data input, and we love to hear about these; but a bug and a documented feature are two very different things).

Data analysis: requirements

  1. Design of experiments/studies

    It is much, much better if you ask us before you start your study. In particular, you will probably want to discuss issues related to study design, variables to record, type of microarray design (controls, dye-swaps, pooling, etc), etc. Otherwise, we might not be able to help you at all with your data: it might be impossible to analyze data from poorly designed studies. We want to emphasize this: many studies are wasted time, money, and effort because they do not allow for sensible statistical analyses.

    (This is from R. A. Fisher himself, in an address to the First Indian Statistical Congress, in 1938: "To consult a statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.")

    We are very happy to discuss with you alternative designs for your prospective studies.

  2. Prenormalized data

    We expect the cDNA microarray data to have been properly normalized. Unless you can justify otherwise, we require the data to have been normalized using print-tip loess. You can use Bioconductor or similar tools, or you can use our dnmad tool. Please, remember not to normalize the data with the GenePix default: save the raw GPR file without normalization.

  3. File format

    The data should be given to us as plain ASCII files, with columns separated with tabulators (tabs). Do not send us Excel files. If you use Excel and then export as ASCII, please ensure that all rows have the same number of columns.

    Besides the microarray data, you will often have additional phenotypic data. Please, send this as a text file (with tabs), were there is one row per subject (or array, or experimental unit) and one column for each variable. For instance, if you have five subjects, and there are three variables (age, sex, hospital), we expect to receive a matrix of five rows and three columns, with an additional first column for the subject ids and a first row with the column names. (Yes, we are aware this format is the transpose of the Pomelo format, for example.)

  4. Sending the data

    You can provide us the data using:

    Please, do not leave the data in the "network drives" (it is very hard for us to access them from GNU/Linux) nor bring us ZIP drives (most of our machines do not have ZIP drives).

Data analysis: general guidelines

Software

We cannot provide support for any kind for software we do not develop. We can, occasionally, help you with software that we are very familiar with (e.g., R, Bioconductor), but this should be the exception rather than the rule; if you think you might want to use R or Bioconductor, come to our courses. Therefore, we do not support programs such as Excel, SPSS, SAS, Acuity, GeneCluster, etc.

Some references

Bross, I.D.J. 1974. The Role of the Statistician: Scientist or Shoe Clerk. The American Statistician, 28: 126--127.

Browne, R. 1996. Tips for Beginning Consultants. The Statistical Consultant, 13 (1): 8--10. (Download).

Finch, H. 1999. Client Expectations in University Statistical Consulting Lab. The Statistical Consultant, 16 (3): 5--9. (Download).

Finch, H. 1999. Client Perceived Pitfalls in Statistical Consulting: An Ethnographic Study. The Statistical Consultant, 18 (1): 9--11. (Download).

Hunter, W.G. 1981. The Practice of Statistics: the Real World is an Idea Whose Time Has Come. The American Statistician, 35: 72--76.

Ittenbach, R.F., Tsai, Y.-J., and Billingsley, C. 1996. Consultation in the Social Sciences: An Itegrated Model for Training and Service. The Statistical Consultant, 13 (3):2--5. (Download).

Kirk, R.E. 1991. Statistical Consulting in a University --dealing with people and other challenges. The American Statistician, 45: 28--34.

Mann, B.L., Quinn, L., Boardman, T., Bishop, T., and Gaydos, B. 1999. What my Mother Never Told Me: Learning the Hard Way. The Statistical Consultant, 16 (3): 2--5. (Download).

Strickland, H. 1996. The Nature of Statistical Consulting. The Statistical Consultant, 13 (2): 2--5. (Download).

Tweedie, R. and Taylor, S. 1998. Consulting: Real Problems, Real Interactions, Real Outcomes. Statistical Science, 13: 1--3. (Download).

Young, S.S. 2001. Industy/Academic Statistics Collaborations. The Statistical Consultant, 18 (1): 2--6. (Download).


Ramón Díaz-Uriarte