Brtdku

Question

I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm

results <- gafs(iris[,1:4], iris[,5],

               iters = 2,

               method = "xgbTree",

               metric = "Accuracy",

               gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),

               trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)

               )

this is however very slow and this even though I am just using iters = 2 instead of iters = 200 as would be more appropriate. What can I do to make this faster?

@JuliusVainora: What popSize should I use and how do I parallise? — Jan 19 at 21:47
Not sure what are recommendations for popSize. The end of the "Details" section in ?gafs is on parallelisation. — Jan 19 at 21:54

score 1 · Accepted Answer · 2019-01-21 14:09:14Z

Here is an example of parallelising the gafs() function using the doParallel package and modifying a few other parameters to make it faster. Where possible I include run times.

The original code is using cross-validation (method = "cv") not repeated cross-validation (method = "repeatedcv"), so I believe the repeats = 2 parameter is ignored. I didn't include that option in the parallelised example.

First, using the original code without any modifications or parallelisation:

> library(caret)

> data(iris)



> set.seed(1)

> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],

                                          iters  = 2, 

                                          method = "xgbTree", 

                                          metric = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method  = "cv", 

                                                                    repeats = 2, 

                                                                    verbose = TRUE),

                                          trConrol = trainControl(method = "cv", 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



Fold01 1 0.9596575 (1)

Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *

Fold02 1 0.9598146 (1)

Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *

Fold03 1 0.9502661 (1)

I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.

Second, including reduced popSize parameter (from 50 to 20), allowParallel and genParallel options to gafsControl() and finally reduced number of folds (from 10 to 5) in both gafsControl() and trControl():

> library(doParallel)

> cl <- makePSOCKcluster(detectCores() - 1)

> registerDoParallel(cl)



> set.seed(1)

> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],

                                          iters   = 2, 

                                          popSize = 20, 

                                          method  = "xgbTree", 

                                          metric  = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method    = "cv", 

                                                                    number    = 5, 

                                                                    verbose   = TRUE, 

                                                                    allowParallel = TRUE, 

                                                                    genParallel   = TRUE),

                                          trConrol = trainControl(method      = "cv", 

                                                                  number      = 5, 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



 final GA

 1 0.9508099 (4)

 2 0.9508099->0.9561501 (4->1, 25.0%) *

 final model

> st.09

   user   system  elapsed

   3.536    0.173 4152.988

My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.

The gafsControl() documentation describes allowParallel and genParallel like so:

allowParallel: if a parallel backend is loaded and available,
should the function use it?

genParallel: if a parallel backend is loaded and available, should
'gafs' use it tp parallelize the fitness calculations within
a generation within a resample?

The caret documentation suggests the allowParallel option will give a bigger run time improvement than the genParallel options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html

I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:

> results.09



Genetic Algorithm Feature Selection



150 samples

4 predictors

3 classes: 'setosa', 'versicolor', 'virginica'



Maximum generations: 2

Population per generation: 20

Crossover probability: 0.8

Mutation probability: 0.1

Elitism: 0



Internal performance values: Accuracy, Kappa

Subset selection driven to maximize internal Accuracy



External performance values: Accuracy, Kappa

Best iteration chose by maximizing external Accuracy

External resampling method: Cross-Validated (5 fold)



During resampling:

  * the top 4 selected variables (out of a possible 4):

    Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)

  * on average, 1.6 variables were selected (min = 1, max = 4)



In the final search using the entire training set:

   * 4 features selected at iteration 1 including:

     Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

   * external performance at this iteration is



   Accuracy       Kappa

     0.9467      0.9200

a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here? — yesterday
b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session. — yesterday
a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation. — yesterday
Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data. — 18 hours ago

score 1 · Accepted Answer · 2019-01-21 14:09:14Z

Here is an example of parallelising the gafs() function using the doParallel package and modifying a few other parameters to make it faster. Where possible I include run times.

The original code is using cross-validation (method = "cv") not repeated cross-validation (method = "repeatedcv"), so I believe the repeats = 2 parameter is ignored. I didn't include that option in the parallelised example.

First, using the original code without any modifications or parallelisation:

> library(caret)

> data(iris)



> set.seed(1)

> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],

                                          iters  = 2, 

                                          method = "xgbTree", 

                                          metric = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method  = "cv", 

                                                                    repeats = 2, 

                                                                    verbose = TRUE),

                                          trConrol = trainControl(method = "cv", 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



Fold01 1 0.9596575 (1)

Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *

Fold02 1 0.9598146 (1)

Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *

Fold03 1 0.9502661 (1)

I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.

Second, including reduced popSize parameter (from 50 to 20), allowParallel and genParallel options to gafsControl() and finally reduced number of folds (from 10 to 5) in both gafsControl() and trControl():

> library(doParallel)

> cl <- makePSOCKcluster(detectCores() - 1)

> registerDoParallel(cl)



> set.seed(1)

> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],

                                          iters   = 2, 

                                          popSize = 20, 

                                          method  = "xgbTree", 

                                          metric  = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method    = "cv", 

                                                                    number    = 5, 

                                                                    verbose   = TRUE, 

                                                                    allowParallel = TRUE, 

                                                                    genParallel   = TRUE),

                                          trConrol = trainControl(method      = "cv", 

                                                                  number      = 5, 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



 final GA

 1 0.9508099 (4)

 2 0.9508099->0.9561501 (4->1, 25.0%) *

 final model

> st.09

   user   system  elapsed

   3.536    0.173 4152.988

My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.

The gafsControl() documentation describes allowParallel and genParallel like so:

allowParallel: if a parallel backend is loaded and available,
should the function use it?

genParallel: if a parallel backend is loaded and available, should
'gafs' use it tp parallelize the fitness calculations within
a generation within a resample?

The caret documentation suggests the allowParallel option will give a bigger run time improvement than the genParallel options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html

I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:

> results.09



Genetic Algorithm Feature Selection



150 samples

4 predictors

3 classes: 'setosa', 'versicolor', 'virginica'



Maximum generations: 2

Population per generation: 20

Crossover probability: 0.8

Mutation probability: 0.1

Elitism: 0



Internal performance values: Accuracy, Kappa

Subset selection driven to maximize internal Accuracy



External performance values: Accuracy, Kappa

Best iteration chose by maximizing external Accuracy

External resampling method: Cross-Validated (5 fold)



During resampling:

  * the top 4 selected variables (out of a possible 4):

    Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)

  * on average, 1.6 variables were selected (min = 1, max = 4)



In the final search using the entire training set:

   * 4 features selected at iteration 1 including:

     Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

   * external performance at this iteration is



   Accuracy       Kappa

     0.9467      0.9200

a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here? — yesterday
b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session. — yesterday
a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation. — yesterday
Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data. — 18 hours ago

搜尋此網誌

Brtdku

Make caret's genetric feature selection faster

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

How can a duplicate class be excluded from sbt assembly?

Cakephp 3.6: Create new view with dropdown and check boxes

database size increased after restoring on another drive

Make caret's genetric feature selection faster

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

How can a duplicate class be excluded from sbt assembly?

Cakephp 3.6: Create new view with dropdown and check boxes

database size increased after restoring on another drive

1 Answer
1

1 Answer
1

1 Answer
1