Unless otherwise stated all the code here is released under the GNU GPL license. Please note that THERE IS NO WARRANTY FOR ANY OF THE PROGRAMS. See section 11 of the GPL for further details. (For some applications the R or C/C++ is not yet available; once the code is properly cleaned up and documented it will be realesed under the GPL as R packages or stand-alone C/C++ files).

Most of this is now available from public repos at my github page.


A Bioconductor package for forward population genetic simulation in asexual populations, with special focus on cancer progression. Fitness can be an arbitrary function of genetic interactions between multiple genes or modules of genes, including epistasis, order restrictions in mutation accumulation (as specified by, say, Oncogenetic Trees or Conjunctive Bayesian Netowrks), and order effects. Also included are functions for plotting and sampling from single or multiple realizations of the simulations, including whole-tumor and single-cell sampling, as well as functions for plotting the true phylogenetic relationships of the clones. The simulation code (in previous encarnations) has been used, for instance, in the paper "Identifying restrictions in the order of accumulation of mutations during tumor progression: effects of passengers, evolutionary models, and sampling". BMC Bioinformatics, 2015. The development is taking place in its github repo, and the package is also listed in the Genetic Simulation Resources catalogue.


A Bioconductor package for the analysis of big data from aCGH experiments using parallel computing and ff objects. A Bioinformatics paper describing it is available here (pubmed link).

Asterias project

Asterias is a set of web-based applications for the analysis of genomic and proteomic data. Currently, Asterias combines Python with R and C/C++, using MPI for parallelization, and aspires to become a standard for high-performance, distributed, web-based bioinformatics and biostatistics applications.

You can access the applications themselves or visit the development pages at either the Bioinformatics org site or The Launchpad site.


RJaCGH is an R package for the analysis of array CGH data using Hidden Markov Models. We incorporate distance between genes (using a non-homogeneous HMM) and do not fix in advance the nubmer of states, but rather use a full Bayesian approach using Reversible Jump MCMC. This is a package developed by Oscar Rueda and myself. The package is available from CRAN and from the Asterias project site. Our new method is described in this COBRA preprint. (You can download it also from here). This package is part of Oscar Rueda's PhD thesis ("Statistical methods for the analysis of copy number alterations in the genome"). You can download his thesis here.


ADaCGH is a web tool for the analysis of array CGH to detect gains and losses in genomic DNA. We implement several very different approaches and also call IDClight to display additional gene information. ADaCGH is a web interface made with Python that uses R underneath (with R and C code written by myself and Oscar Rueda Palacio) and uses parallelization to speed up the computations.

Pomelo II

Pomelo II is a major re-writte of our popular Pomelo for finding differentially expressed genes. Pomelo II uses MPI: my original C++ code has been parallelized by Edward Morrissey, and provides clickable tables and heatmaps (using our IDClight tool) in a much nicer and configurable interface written mainly by Edward Morrissey in Python and JavaScript using AJAX.


SignS is a web tool for gene selection and finding molecular signatures when we have patient survival data. We implement two very different methods, and provide additional gene information in clickable tables and dendrograms thanks to calling our IDClight application. SignS is a web interface made with Python that uses R underneath. To greatly speed up the computations, we use MPI (which takes adavantage of the 66 CPUs available on our servers).


GeneSrF is a web tool for gene selection in classification problems that uses random forest. Two approaches for gene selection are used: one is targeted towards identifying small, non-redundant sets of genes that have good predictive performance. The second is a more heuristic graphical approach that can be used to identify large sets of genes (including redundant genes) related to the outcome of interest. This is a web interface (using Python) of my varSelRF package.


An R package for variable selection using random forests, targeted towards gene expression data. Details can be found in "Variable selection from random forests: application to gene expression data.". Download the source package.


geSignatures is an R package for finding molecular signatures from gene expression data, as described in the technical report Molecular signatures from gene expression data.

Download the source package; it can be installed under GNU/Linux (and other Unixes) with the usual "R CMD INSTALL geSignatures_0.6-5.tar.gz". Download the windows version. You can use the R menus to install from a local ZIP package.


Pomelo is a web-based tool that can be used to find differentially expressed genes. It currently implements statistical tests for two-group (via t-tests) and multigroup (via ANOVA) comparisons, regression analysis, survival data (gene-wise Cox model) and contingency tables (using Fisher's exact test). We allow control of the Family Wise Error Rate (using the maxT approach) and the False Discovery Rate.

A few small example files for Pomelo (tar.gz format or Windows (WinZIP); in Unix/GNU Linux do "tar -zxvf example.files.tar.gz"). Course notes.

Download the source code for the statistical tests. This software is released under the GNU GPL. A lot of the code borrows heavily from code in R. See the README file. This compressed file includes also a Windows executable.


FatiGO can be used to examine whether groups of genes are enriched in certain Gene Ontology terms. We use Fisher's exact test for contingency table with adjustments for multiple testing.

Download the source code for the statistical tests underlying FatiGO. This software is released under the GNU GPL. The C code for Fisher's test comes from R. See the README file.


DNMAD stands for diagnosis and normalization of microarray data. It is a web server for cDNA microarrays normalization and diagnosis, developed with together with Juanma Vaquerizas (jvaquerizas AT cnio DOT es).

You can download the R code. This software is released under the GNU GPL. We make heavy use of limma, a Bioconductor package.

Tnasas: a predictor-building tool

Tnasas, which stands for "this is not a substitute for a statistician", is a tool for building predictors from microarray data. It is useful as a benchmark (it offers several well tested methods) and as pedagogical tool (against overoptimism when building predictors and ignoring several selection biases). Developed with Juanma Vaquerizas (jvaquerizas AT cnio DOT es) at CNIO, using R. The code (R with a tiny bit of C++) will soon be available under the GNU GPL.


PHYLOGR is an R package for the manipulation and analysis of phylogenetically simulated data sets and phylogenetically based analyses using GLS. You can download the source package here or you can get it from CRAN, where you can download both the source and windows binaries.

This code was developed in collaboration with Ted Garland. We are currently working on a paper about PHYLOGR (abstract).

ape is another R package for phylogenetics and evolution, but there is little overlap between ape and PHYLOGR.


These are a set of programs in RPL ---reverse polish lisp--- to use the HP 48 calculator as a handheld computer to record behavioral data, and help in the execution of a behavioral experiment. Included are some utility functions in C++ for the processing and cleaning of the output.

I used this code heavily for recording lizard behavior. You can see more details (all the code, documentation, etc) here. This software is released under the GNU GPL.

Genetic algorithms, evolutionarily stable strategies, and the loser/winner effects

I spent some time working on the loser/winner effects. This is a problem that requires game theory (what is a good strategy depends on what your opponents do) but I could not find simple analytical solutions. So I used genetic algorithms, which seems natural enough here since we are dealing with the evolution of behavioral strategies.

There are several libraries for genetic algorithms. I started using galib, a very nice library. However, I found it hard to use of it for my problem, where fitness is the result of repeated interactions between the genotypes (and not something you evaluate in a sweep over the population at each generation); this is doable with galib, but I found it hard and awkward. Thus, to learn more C++ and to have more control, I wrote a set classes and methods for genetic algorithms (ga.cpp, ga.h). And the code for the loser/winner part (fighting.cpp), plus a few helper functions, etc.

Please note that the documentation is non-existent (you'll need to read the comments), that there are a few comments in Spanish and spanglish, and that indentation and line width are "peculiar" (set to fit my monitor and usage of XEmacs). Its been a while since I worked on these issues. But I'd appreciate that, if you use this code, you let me know.

To run it you will need to install Blitz++ and libRmath, the stand-alone math library from R. If you use Debian GNU/Linux, this is as easy as:

apt-get install blitz
apt-get install r-mathlib

I've also run it with other GNU/Linux distributions.

Download the code. This software is released under the GNU GPL.

Ramon Diaz-Uriarte
Last modified: August 2014

Valid XHTML 1.0! Viewable With Any