Ramón Díaz-Uriartetex2html_wrap_inline$^1,3$, Sara Alvarez de Andréstex2html_wrap_inline$^2$
tex2html_wrap_inline$^1$Bioinformatics Unit, tex2html_wrap_inline$^2$Cytogenetics Unit
Biotechnology Programme
Spanish National Cancer Center (CNIO)
Melchor Fernández Almagro 3
Madrid, 28029
Spain.
tex2html_wrap_inline$^3$ Author for correspondence.
rdiaz@cnio.es
http://ligarto.org/rdiaz
Date:
2005-03-14
Running Head: Gene selection with random forest.
Random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its use for gene selection.
We first show the effects of changes in parameters of random forest on the prediction error rate with microarray data. Then we present two approaches for gene selection with random forest: 1) comparing scree plots of variable importance from original and permuted data sets; 2) using backwards variable elimination. Using simulated and real microarray data, we show: 1) scree plots can be used to recover the full set of genes related to the outcome of interest, without being adversely affected by collinearities; 2) backwards variable elimination yields small sets of genes while preserving predictive accuracy (compared to several state-of-the art algorithms). Thus, both methods are useful for gene selection.
R code is available from the authors under the GNU GPL.
Supplementary information: http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.htmlhttp://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html
Random forest is an algorithm for classification developed by Leo Breiman (Breiman, 2001b) that uses an ensemble of classification trees (Breiman et al., 1984; Hastie et al., 2001; Ripley, 1996). Each of the classification trees is built using a bootstrap sample of the data, and at each split the candidate set of variables is a random subset of the variables. Thus, random forest uses both bagging (bootstrap aggregation), a successful approach for combining unstable learners (Breiman, 1996; Hastie et al., 2001), and random variable selection for tree building. Each tree is unpruned (grown fully), so as to obtain low-bias trees; at the same time, bagging and random variable selection result in low correlation of the individual trees. The algorithm yields an ensemble that can achieve both low bias and low variance (from averaging over a large ensemble of low-bias, high-variance but low correlation trees).
Random forest has excellent performance in classification tasks,
comparable to support vector machines. Although random forest is not
widely used in the microarray literature (but see Alvarez et al., 2005; Gunther et al., 2003; Izmirlian, 2004; Man et al., 2004; Schwender et al., 2004; Wu et al., 2003), it has several characteristics that make it
ideal for these data sets: a) can be used when there are many more
variables than observations; b) has good predictive performance
even when most predictive variables are noise; c) does not
overfit; d) can handle a mixture of categorical and continuous
predictors; e) incorporates interactions among predictor variables; f)
the output is invariant to monotone transformations of the predictors;
g) there are high quality and free implementations: the original
Fortran code from L. Breiman and A. Cutler, and an R package from
A. Liaw and M. Wiener (Liaw & Wiener, 2002); h) there is little need to
fine-tune parameters to achieve excellent performance. The most
important parameter to choose is
, the number of input variables
tried at each split, but it has been reported that the default value
is often a good choice (Liaw & Wiener, 2002). In addition, the user needs
to decide how many trees to grow for each forest (
) as well as
the minimum size of the terminal nodes (
). These three
parameters will be throughly examined in this paper.
Given these promising features, it is important to understand the
performance of random forest compared to alternative state-of-the-art
prediction methods with microarray data sets, as well as the effects
of changes in the parameters of random forest. In this paper we present, as necessary
background for the main topic of the paper (gene selection), the first
through examination of these issues, including evaluating the effects
of
,
and
on error rate using
nine real microarray data sets and simulated data.
The main question addressed in this paper is gene selection using
random forest. A few authors have previously used variable selection
with random forest. Dudoit & Fridlyand (2003), Wu et al. (2003) use
filtering approaches and, thus, do not take advantage of the measures of
variable importance returned by random forest as part of the
algorithm. Svetnik et al. (2004) propose a method that is somewhat similar
to our second approach. The main difference is that Svetnik et al. (2004)
first find the ``best'' dimension (
) of the model, and then choose
the
most important variables. This is a sound strategy when the
objective is to build accurate predictors, without any regards for
model interpretability. But this might not be the most appropriate
for our purposes as it shifts the emphasis away from selection of
specific genes, and in genomic studies the identity of the selected
genes is relevant (e.g., to understand molecular pathways or to
find targets for drug development).
The last issue addressed in this paper is the multiplicity (or lack of uniqueness or stability) problem. Variable selection with microarray data can lead to many solutions that are equally good from the point of view of prediction rates, but that share few common genes. This multiplicity problem has been emphasized by Somorjai et al. (2003) and a recent example is shown in Ein-Dor et al. (2005). Although lack of uniqueness of results is not a problem when the only objective of our method is prediction, it casts serious doubts on the biological interpretability of the results (Somorjai et al., 2003). Unfortunately most ``methods papers'' in bioinformatics do not evaluate the stability of the results obtained, leading to a false sense of trust on the biological interpretability of the output obtained. Our paper presents a through and critical evaluation of the stability of the lists of selected genes with the proposed (and two competing) methods.
When facing gene selection problems, biomedical researchers often show interest in one of the following objectives:
Random forest returns several measures of variable importance. The most reliable measure is based on the decrease of classification accuracy when values of a variable in a node of a tree are permuted randomly (Breiman, 2001b; Bureau et al., 2003), and this is the measure of variable importance (in its unscaled version --see supplementary material) that we will use in the rest of the paper.
It is often helpful to plot ordered variable importances, yielding plots that resemble the ``scree plots'' or ``scree graphs'' common in principal component analysis (Jolliffe, 2002). We can then compare the observed plot with similar plots generated under an appropriate null hypothesis. This idea is similar to some proposals on the principal component literature (e.g., Jolliffe, 2002) and the ``importance spectrum'' plots in the biclustering algorithm from Friedman & Meulman (2005). In our case we want to compare the scree plot from the original data with scree plots that are generated when the class labels and the predictors are independent. Therefore we permute only the class labels, leaving intact the correlation structure of the predictors. This approach to gene selection is targeted towards the first objective above: the identification of relevant genes for subsequent research.
A different approach is to iteratively refit random forests, at each iteration building a new forest after discarding those variables (genes) with the smallest variable importances, and then select the set of genes that yields the smallest error rate. Random forest returns a measure of error rate based on the out-of-bag cases for each fitted tree, the OOB error, and this is the criterion of selection we will use. Note that in this section we are using OOB error to choose the final set of genes, not to obtain unbiased estimates of the error rate of this rule. Because of the iterative approach, the OOB error is biased down and cannot be used to asses the overall error rate of the approach, for reasons analogous to those leading to ``selection bias'' (Ambroise & McLachlan, 2002; Simon et al., 2003a). To assess prediction error rates we will use the bootstrap, not OOB error (see section 3.3).
In our algorithm we examine all forest that result from eliminating,
iteratively, a fraction,
, of the genes (the
least important ones) used in the previous iteration. By default,
which allows for relatively fast operation,
is coherent with the idea of an ``aggressive variable selection''
approach, and increases the resolution as the number of genes
considered becomes smaller. We do not recalculate variable
importances at each step as Svetnik et al. (2004) mention severe overfitting
resulting from recalculating variable importances. After fitting all
forests, we examine the OOB error rates from all the fitted random
forests. We choose the solution with the smallest number of genes
whose error rate is within
standard errors of the minimum error
rate of all forests.
Setting
is the same as selecting the set of genes that
leads to the smallest error rate. Setting
is similar to the
common ``1 s.e. rule'', used in the classification trees literature
(Breiman et al., 1984; Ripley, 1996); this strategy can lead to solutions with
fewer genes than selecting the solution with the smallest error
rate, while achieving an error rate that is not
different, within sampling error, from the ``best solution''. In this
paper we will examine both the ``1 s.e. rule'' and the ``0 s.e.
rule''.
| Dataset | Original ref. | Genes | Patients | Classes |
| Leukemia | Golub et al. (1999) | 3051 | 38 | 2 |
| Breast | van 't Veer et al. (2002) | 4869 | 78 | 2 |
| Breast | van 't Veer et al. (2002) | 4869 | 96 | 3 |
| NCI 60 | Ross et al. (2000) | 5244 | 61 | 8 |
| Adenocar- | ||||
| cinoma | Ramaswamy et al. (2003) | 9868 | 76 | 2 |
| Brain | Pomeroy et al. (2002) | 5597 | 42 | 5 |
| Colon | Alon et al. (1999) | 2000 | 62 | 2 |
| Lymphoma | Alizadeh et al. (2000) | 4026 | 62 | 3 |
| Prostate | Singh et al. (2002) | 6033 | 102 | 2 |
| Srbct | Khan et al. (2001) | 2308 | 63 | 4 |
To evaluate if the proposed procedure can recover the signal in the
data, we need to use simulated data, so that we know exactly which
genes are relevant. Data have been simulated using different numbers
of classes of patients (2 to 4), number of independent dimensions (1
to 3), and number of genes per dimension (5, 20, 100). In all cases,
we have set to 25 the number of subjects per class. Each independent
dimension has the same relevance for discrimination of the classes.
The data come from a multivariate normal distribution with variance of
1, a (within-class) correlation among genes within dimension of
0.9, and a within-class correlation of 0 between genes from different
dimensions, as those are independent. The multivariate means have
been set so that the unconditional prediction error rate
(McLachlan, 1992) of a linear discriminant analysis using one gene
from each dimension is approximately 5%. To each data set we have
added 2000 random normal variates (mean 0, variance 1) and 2000 random
uniform
variates. In addition, we have generated data sets
for 2, 3, and 4 classes where no genes have signal (all 4000 genes are
random). For the non-signal data sets we have generated four
replicate data sets for each level of number of classes. Further
details are provided in the supplementary material.
We have compared the predictive performance of the variable selection
approach with: a) random forest without any variable selection (using
,
,
); b) three other methods that have shown good
performance in reviews of classification methods with microarray data
(Dettling, 2004; Dudoit et al., 2002; Romualdi et al., 2003) but that do not include
any variable selection; c) two methods that carry out
variable selection.
For the three methods that do not carry out variable selection,
Diagonal Linear Discriminant Analysis (DLDA), K
nearest neighbor (KNN), and Support Vector Machines (SVM)
with linear kernel, we have used, based on Dudoit et al. (2002), the 200
genes with the largest
-ratio of between to within groups sums of
squares. For KNN, the number of neighbors (
) was
chosen by cross-validation as in Dudoit et al. (2002).
One of the methods that incorporates gene selection is
Shrunken centroids (SC), developed by Tibshirani et al. (2002). We
have used two different approaches to determine the best number of
features. In the first one, SC.l, we choose the number of
genes that minimizes the cross-validated error rate and, in case of
several solutions with minimal error rates, we choose the one with
largest likelihood. In the second approach, SC.s, we choose
the number of genes that minimizes the cross-validated error rate and,
in case of several solutions with minimal error rates, we choose the
one with smallest number of genes (larger penalty). The second method
that incorporates gene selection is Nearest neighbor +
variable selection (NN.vs), where we filter genes using the
F-ratio, and select the number of genes that leads to the smallest
error rate; in our implementation, we run a Nearest Neighbor
classifier (KNN with K = 1) on all subsets of genes that result from
eliminating
of the genes (the ones with the smallest F-ratio)
used in the previous iteration. This approach, in its many variants
(changing both the classifier and the ordering criterion) is popular
in microarray papers; a recent example is Roepman et al. (2005), and
similar general strategies are implemented in the program Tnasas
(Herrero et al., 2004). Further
details of all these methods are provided in the supplementary
material. All simulations and analyses were carried out with R
(http://www.r-project.org; R Development Core Team, 2004), using
packages randomForest (from A. Liaw and M. Wiener) for random
forest, e1071 (E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and
A. Weingessel) for SVM, class (B. Ripley and W. Venables) for KNN,
and PAM (see Tibshirani et al., 2002) for shrunken centroids, and
geSignatures (by R.D.-U.) for DLDA.
|
|
Preliminary data suggested that
and
could affect the
shape of scree plots and their ``sharpness''. Thus, most of the
results presented here (either in the main paper or the supplementary
material) use a range of
values and
values (40000, 20000, 5000 for real data; 40000 and 5000 for simulated
data). At the same time, use of OOB error rate as a guidance to
select
could be affected by
and, potentially,
. Thus, we first examined whether the OOB error rate is
substantially affected by changes in
,
, and
.
Figure 1 and the supplementary material
(Figure
``error.vs.mtry.pdf''), however, show that, for both real and
simulated data, the relation of OOB error rate with
is largely
independent of
(for
between 1000 and 40000) and
(nodesizes 1 and 5). In addition, the default setting of
(
in the figures) is often a good choice
in terms of OOB error rate. In some cases, increasing
can lead
to small decreases in error rate, and decreases in
often lead
to increases in the error rate. Since the OOB error and the relation
between OOB error and
do not change whether we use
of
1 or 5, and because the increase in speed from using
of 5
is inconsequential, all further analyses will use only the default
.
Figure 2 shows ``scree plots'', together
with random scree plots from data with permuted class labels (using 40
permutations), for a representative set of simulations. Another 216
scree plots, from the other simulation scenarios,
s, and
s, are shown in the supplementary material. The supplementary
material also shows scree plots for simulated data when no genes are
related to the class variable.
All these plots show the
same qualitative patterns. First, in the absence of any signal, the
scree plots from the original data are not different from those of
the data with permuted class labels (see supplementary figure:
``scree.plots.no-signal.pdf''). Second, in the presence of relevant
genes, in virtually all cases the ``relevant genes'' have variable
importances well above those from the permuted data. Moreover, the
ordering of the variable importances is such that the relevant genes
are almost always ranked above any non-relevant variable. Third,
except for a few cases (e.g., 2 classes, 3 independent components,
and 100 genes per relevant component), the scree plots show a sharp
jump in importance between relevant and non-relevant genes.
Together, these results indicate not only that scree plots can be
used to recover relevant genes, but that even in situations of
extremely high redundancy (e.g., with 100 genes per independent
dimension, with a correlation between genes of 0.9) the variable
importances returned by random forest can be used as guidance for
gene selection. The figures, however, show two limitations of scree
plots: first, a few non-relevant genes also tend to have variable
importances above the scree plots from the permuted data, thus
suggesting that selection of genes using scree plots could lead to
some false positives; second, there are at least some cases where
there is no abrupt difference between relevant and non-relevant genes
in the scree plot, making it hard to determine boundaries for gene
selection. The latter is easier to observe when the number of
independent dimensions is equal or larger than the number of classes.
The effects of changes in
and
can be seen in the
supplementary figures. Using 40000 instead of 5000 trees (the
parameter) has negligible effects. The effects of the
,
however, are more complex. On the one hand, increasing
leads
to larger importance values for some of the relevant genes, but at
the same time as
increases the ``sharpness'' of the scree plot
decreases, in particular in cases with 100 highly correlated genes,
and thus makes it harder to use scree plots for guidance in
situations of high correlation between predictive genes. Extreme
effects of these patterns can be seen in the supplementary material
where we use
values equal to the number of variables (genes)
in the data set (essentially turning random forest into bagging) or
equal to half of the genes in the data set.
|
|
Data set |
no info | SVM | KNN | DLDA | SC.l | SC.s | NN.vs | random forest | random forest var.sel. | |
| s.e. 0 | s.e. 1 | |||||||||
| Leukemia | 0.289 | 0.014 | 0.029 | 0.020 | 0.025 | 0.062 | 0.056 | 0.051 | 0.087 | 0.075 |
| Breast 2 cl. | 0.429 | 0.325 | 0.337 | 0.331 | 0.324 | 0.326 | 0.337 | 0.342 | 0.337 | 0.332 |
| Breast 3 cl. | 0.537 | 0.380 | 0.449 | 0.370 | 0.396 | 0.401 | 0.424 | 0.351 | 0.346 | 0.364 |
| NCI 60 | 0.852 | 0.256 | 0.317 | 0.286 | 0.256 | 0.246 | 0.237 | 0.252 | 0.327 | 0.353 |
| Adenocar. | 0.158 | 0.203 | 0.174 | 0.194 | 0.177 | 0.179 | 0.181 | 0.125 | 0.185 | 0.207 |
| Brain | 0.761 | 0.138 | 0.174 | 0.183 | 0.163 | 0.159 | 0.194 | 0.154 | 0.216 | 0.216 |
| Colon | 0.355 | 0.147 | 0.152 | 0.137 | 0.123 | 0.122 | 0.158 | 0.127 | 0.159 | 0.177 |
| Lymphoma | 0.323 | 0.010 | 0.008 | 0.021 | 0.028 | 0.033 | 0.04 | 0.009 | 0.047 | 0.042 |
| Prostate | 0.490 | 0.064 | 0.100 | 0.149 | 0.088 | 0.089 | 0.081 | 0.077 | 0.061 | 0.064 |
| Srbct | 0.635 | 0.017 | 0.023 | 0.011 | 0.012 | 0.025 | 0.031 | 0.021 | 0.039 | 0.038 |
The supplementary material (see
``scree.plots.real.data.pdf'' and ``scree.plots.real.huge.mtry.pdf'')
includes scree plots for other values of
and
. Changes
in
are often inconsequential (because of the high correlation
between variable importances, as shown in figure
``real.data.importances.scatter.pdf'' in the supplementary material).
Increasing
, however, would lead to choosing fewer genes, since
larger
values are associated with a faster approach of the
importances of the real data to the importances of the permuted data,
a trend we already shaw in the simulated data; however, this decrease
in the number of selected genes might lead to the exclusion of
highly correlated genes (see results for simulated data, and
discussion). The relative ordering of the genes, based on variable
importances, however, does not change with
(see
``real.data.importances.pairs.pdf'' in supplementary material);
increases in
lead to a compression of the relative range of
importances of all but a few of the most extreme genes.
Results for the real data sets are shown in Tables 2 and
3 (see also supplementary material, Tables 5, 6, 7,
for additional results using different combinations of
,
). Error rates (see Table
2) when performing variable selection are comparable
(within sampling error) to those from random forest without variable
selection, and comparable also to the error rates from competing
state-of-the-art prediction methods. The number of genes selected
varies by data set, but in most cases (Table 3) the
variable selection procedure leads to small (
) sets of predictor
genes, often much smaller than those from competing approaches
(see also Table 8 in supplementary material). There are no relevant
differences in error rate related to differences in
,
or
whether we use the ``s.e. 1'' or ``s.e. 0'' rules. The use of the
``s.e. 1'' rule, however, tends to result in smaller sets of selected
genes.
To evaluate the stability of scree plots, we show selection
probability plots in Figure 4; in the figures,
has been set using as guidance Figure 3.
Overall,
genes ranked in the top
tend to appear in a large proportion
of bootstrap samples (e.g.,
); but those in the lower ranks,
even if well above the random scree plots, do not tend to be
repeatedly ranked among the top
genes in bootstrap samples.
|
| Data set | Error | # Genes | # Genes boot. | Freq. genes |
| Backwards elimination of genes from random forest | ||||
|
|
||||
| Leukemia | 0.087 | 2 | 2 (2, 2) | 0.38 (0.29, 0.48)1 |
| Breast 2 cl. | 0.337 | 14 | 9 (5, 23) | 0.15 (0.1, 0.28) |
| Breast 3 cl. | 0.346 | 110 | 14 (9, 31) | 0.08 (0.04, 0.13) |
| NCI 60 | 0.327 | 230 | 60 (30, 94) | 0.1 (0.06, 0.19) |
| Adenocar. | 0.185 | 6 | 3 (2, 8) | 0.14 (0.12, 0.15) |
| Brain | 0.216 | 22 | 14 (7, 22) | 0.18 (0.09, 0.25) |
| Colon | 0.159 | 14 | 5 (3, 12) | 0.29 (0.19, 0.42) |
| Lymphoma | 0.047 | 73 | 14 (4, 58) | 0.26 (0.18, 0.38) |
| Prostate | 0.061 | 18 | 5 (3, 14) | 0.22 (0.17, 0.43) |
| Srbct | 0.039 | 101 | 18 (11, 27) | 0.1 (0.04, 0.29) |
|
|
||||
| Leukemia | 0.075 | 2 | 2 (2, 2) | 0.4 (0.32, 0.5)1 |
| Breast 2 cl. | 0.332 | 14 | 4 (2, 7) | 0.12 (0.07, 0.17) |
| Breast 3 cl. | 0.364 | 6 | 7 (4, 14) | 0.27 (0.22, 0.31) |
| NCI 60 | 0.353 | 24 | 30 (19, 60) | 0.26 (0.17, 0.38) |
| Adenocar. | 0.207 | 8 | 3 (2, 5) | 0.06 (0.03, 0.12) |
| Brain | 0.216 | 9 | 14 (7, 22) | 0.26 (0.14, 0.46) |
| Colon | 0.177 | 3 | 3 (2, 6) | 0.36 (0.32, 0.36) |
| Lymphoma | 0.042 | 58 | 12 (5, 73) | 0.32 (0.24, 0.42) |
| Prostate | 0.064 | 2 | 3 (2, 5) | 0.9 (0.82, 0.99)1 |
| Srbct | 0.038 | 22 | 18 (11, 34) | 0.57 (0.4, 0.88) |
| Alternative approaches | ||||
| SC.s | ||||
| Leukemia | 0.062 | 822 | 46 (14, 504) | 0.48 (0.45, 0.59) |
| Breast 2 cl. | 0.326 | 31 | 55 (24, 296) | 0.54 (0.51, 0.66) |
| Breast 3 cl. | 0.401 | 2166 | 4341 (2379, 4804) | 0.84 (0.78, 0.88) |
| NCI 60 | 0.246 | 5118 | 4919 (3711, 5243) | 0.84 (0.74, 0.92) |
| Adenocar. | 0.179 | 0 | 9 (0, 18) | NA (NA, NA) |
| Brain | 0.159 | 4177 | 1257 (295, 3483) | 0.38 (0.3, 0.5) |
| Colon | 0.122 | 15 | 22 (15, 34) | 0.8 (0.66, 0.87) |
| Lymphoma | 0.033 | 2796 | 2718 (2030, 3269) | 0.82 (0.68, 0.86) |
| Prostate | 0.089 | 4 | 3 (2, 4) | 0.72 (0.49, 0.92) |
| Srbct | 0.025 | 373 | 18 (12, 40) | 0.45 (0.34, 0.61) |
| NN.vs | ||||
| Leukemia | 0.056 | 512 | 23 (4, 134) | 0.17 (0.14, 0.24) |
| Breast 2 cl. | 0.337 | 88 | 23 (4, 110) | 0.24 (0.2, 0.31) |
| Breast 3 cl. | 0.424 | 9 | 45 (6, 214) | 0.66 (0.61, 0.72) |
| NCI 60 | 0.237 | 1718 | 880 (360, 1718) | 0.44 (0.34, 0.57) |
| Adenocar. | 0.181 | 9868 | 73 (8, 1324) | 0.13 (0.1, 0.18) |
| Brain | 0.194 | 1834 | 158 (52, 601) | 0.16 (0.12, 0.25) |
| Colon | 0.158 | 8 | 9 (4, 45) | 0.57 (0.45, 0.72) |
| Lymphoma | 0.04 | 15 | 15 (5, 39) | 0.5 (0.4, 0.6) |
| Prostate | 0.081 | 7 | 6 (3, 18) | 0.46 (0.39, 0.78) |
| Srbct | 0.031 | 11 | 17 (11, 33) | 0.7 (0.66, 0.85) |
To assess the stability of the backwards variable selection procedure,
Table 3 (see also supplementary material, Tables 5, 6, 7,
for other combinations of
)
shows the variation in the number of genes selected in bootstrap
samples, and the frequency with which the genes selected in the
original sample appear among the genes selected from the bootstrap
samples. In most cases, there is a wide range in the number of genes
selected; more importantly, the genes selected in the original samples
are rarely selected in more than 50% of the bootstrap samples. These
results are not strongly affected by variations in
or
;
using the ``s.e. 1'' rule can lead, in some cases, to increased
stability of the results.
As a comparison, we also show in Table 3 the stability of two alternative approaches for gene selection, the shrunken centroids method, and a filter approach combined with a Nearest Neighbor classifier (see Table 8 in the supplementary material for results of SC.l). Error rates are comparable, but both alternative methods lead to much larger sets of selected genes than backwards variable selection with random forests. The alternative approaches seem to lead to somewhat more stable results in variable selection but in practical applications this increase in stability is probably far out-weighted by the very large number of selected genes.
We have examined the performance of two approaches for gene selection using random forest, and compared them to alternative approaches. Our results, using both simulated and real microarray data sets, show that these two new methods of gene selection accomplish the proposed objectives.
The first approach for gene selection with random forest is based on scree-plots. This method can be used to select ``relevant genes'', even when there are very high correlations between these important genes. This is a particularly important feature, and one that sets it apart from many alternatives methods which are very adversely affected by multicolinearities. Especially in preliminary work oriented towards understanding biological mechanisms, methods that return only a few genes (the ones with stronger signal from subsets of high correlation) can lead to distorted pictures of the phenomenon under study. Therefore, a method such as scree plots from variable importances returned by random forest that can return large sets of highly correlated ``important genes'' is a useful tool.
The second approach for gene selection with random forest uses backwards variable elimination. This method returns very small sets of genes compared to two alternative variable selection methods, while retaining predictive performance comparable to that of seven alternative state-of-the-art methods. In contrast to the approach based on scree plots, the backwards variable elimination method will not return sets of genes that are highly correlated, because they are redundant. The backwards variable elimination methods will be most useful under two scenarios: a) when considering the design of diagnostic tools, where having a small set of probes is often desirable; b) to help understand the results from the scree-plot output, because it highlights which genes, out of those returned from the scree plots, have the largest signal to noise ratio and could be used as surrogates for complex processes involving many correlated genes. A backwards elimination method, precursor to the one used here, has been already used to predict breast tumor type based on chromosomic alterations (Alvarez et al., 2005).
We have also examined throughly the effects of changes in the
parameters of random forest (specifically
,
,
)
and the variable selection algorithm (
,
).
Changes in these parameters have in most cases negligible effects,
suggesting that the default values are often good options, but we can
make some general recommendations. Values of
larger than the
default (
) can increase the
distinctiveness of a few most important genes but at the same time
make it harder to recover sets of highly correlated genes: as
increases so does the likelihood that highly correlated genes will be
evaluated in the same node, thus increasing the chance that some genes
will mask the effects of other correlated genes.
The parameter
has a strong effect on the speed of execution of
the code. Larger
values lead to slightly more stable values
of variable importances, but for the data sets examined,
or
seems quite adequate, with further increases
having negligible effects (both in scree plots and the backwards
variable selection algorithm). The value of
has negligible
effects, and thus its default setting of 1 is appropriate. For the
backwards elimination algorithm, the parameter
can
be adjusted to modify the resolution of the number of variable
selected; smaller values of
lead to finer
resolution in the examination of number of genes, but to slower
execution of the code. Finally, the parameter
has also minor
effects on the results of the backwards variable selection algorithm
but a value of
leads to slightly more stable results.
The final issue addressed in this paper is instability or multiplicity of the selected sets of genes. From this point of view, the results are slightly disappointing. But so are the results of the competing methods. And so are the results of most examined methods so far with microarray data, as shown in Ein-Dor et al. (2005) and discussed throughly by Somorjai et al. (2003). However, and except for the above cited papers of Ein-Dor et al. (2005) Somorjai et al. (2003) and the review in Díaz-Uriarte (2005), this is an issue that still seems largely ignored in the microarray literature. As these papers and the statistical literature on variable selection (e.g., Breiman, 2001a; Harrell, 2001) discusses, the causes of the problem are small sample sizes and the extremely small ratio of samples to variables (i.e., number of arrays to number of genes). Thus, we might need to learn to live with the problem, and try to assess the stability and robustness of our results by using a variety of gene selection features, and examining whether there is a subset of features that tends to be repeatedly selected. This concern is explicitly taken into account in our results, and facilities for examining this problem are part of our R code.
The multiplicity problem, however, does not need to result in large prediction errors. This and other papers (Dettling & Bühlmann, 2004; Dettling, 2004; Dudoit et al., 2002; Romualdi et al., 2003; Simon et al., 2003b; Somorjai et al., 2003) show that very different classifiers often lead to comparable and successful error rates with a variety of microarray data sets. Thus, although improving predictor rates is important (specially if giving consideration to ROC curves, and not just overall prediction error rates; Pepe, 2003), when trying to address questions of biological mechanism or discover therapeutic targets, probably a more challenging and relevant issue is to identify sets of genes with biological relevance.
Two areas of future research are increasing the steepness of the scree plots when there are large numbers of highly correlated genes and improving the computational efficiency of these approaches; in the present work, we have used parallelization of the ``embarrassingly parallelizable'' tasks using MPI with the Rmpi and Snow packages (Tierney et al., 2004; Yu, 2004) for R. In a broader context, further work is warranted on the stability properties of this and other gene-selection approaches, because the multiplicity problem casts doubts on the biological interpretability of most results based on a single run of one gene-selection approach.
Most of the simulations and analyses were carried out in the Beowulf cluster of the Bioinformatics unit at CNIO, financed by the RTICCC from the FIS; J. M. Vaquerizas provided help with the administration of the cluster. A. Liaw provided discussion, unpublished manuscripts, and code. C. Lázaro-Perea provided many discussions and comments on the ms. A. Sánchez provided comments on the ms. I. Díaz showed R.D.-U. the forest, or the trees, or both. R.D.-U. partially supported by the Ramón y Cajal program of the Spanish MCyT (Ministry of Science and Technology); S.A.A. supported by project C.A.M. GR/SAL/0219/2004; funding provided by project TIC2003-09331-C02-02 of the Spanish MCyT.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 rfVarSel.tex
The translation was initiated by ramon on 2005-03-15