Make caret's genetric feature selection faster












1















I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm



results <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),
trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)
)


this is however very slow and this even though I am just using iters = 2 instead of iters = 200 as would be more appropriate. What can I do to make this faster?










share|improve this question























  • Parallelisation, reduced popSize, repeats = 1, ...

    – Julius Vainora
    Jan 18 at 14:40











  • @JuliusVainora: What popSize should I use and how do I parallise?

    – Make42
    Jan 19 at 21:47











  • Not sure what are recommendations for popSize. The end of the "Details" section in ?gafs is on parallelisation.

    – Julius Vainora
    Jan 19 at 21:54
















1















I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm



results <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),
trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)
)


this is however very slow and this even though I am just using iters = 2 instead of iters = 200 as would be more appropriate. What can I do to make this faster?










share|improve this question























  • Parallelisation, reduced popSize, repeats = 1, ...

    – Julius Vainora
    Jan 18 at 14:40











  • @JuliusVainora: What popSize should I use and how do I parallise?

    – Make42
    Jan 19 at 21:47











  • Not sure what are recommendations for popSize. The end of the "Details" section in ?gafs is on parallelisation.

    – Julius Vainora
    Jan 19 at 21:54














1












1








1








I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm



results <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),
trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)
)


this is however very slow and this even though I am just using iters = 2 instead of iters = 200 as would be more appropriate. What can I do to make this faster?










share|improve this question














I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm



results <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),
trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)
)


this is however very slow and this even though I am just using iters = 2 instead of iters = 200 as would be more appropriate. What can I do to make this faster?







r genetic-algorithm r-caret






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 18 at 14:14









Make42Make42

3,03923074




3,03923074













  • Parallelisation, reduced popSize, repeats = 1, ...

    – Julius Vainora
    Jan 18 at 14:40











  • @JuliusVainora: What popSize should I use and how do I parallise?

    – Make42
    Jan 19 at 21:47











  • Not sure what are recommendations for popSize. The end of the "Details" section in ?gafs is on parallelisation.

    – Julius Vainora
    Jan 19 at 21:54



















  • Parallelisation, reduced popSize, repeats = 1, ...

    – Julius Vainora
    Jan 18 at 14:40











  • @JuliusVainora: What popSize should I use and how do I parallise?

    – Make42
    Jan 19 at 21:47











  • Not sure what are recommendations for popSize. The end of the "Details" section in ?gafs is on parallelisation.

    – Julius Vainora
    Jan 19 at 21:54

















Parallelisation, reduced popSize, repeats = 1, ...

– Julius Vainora
Jan 18 at 14:40





Parallelisation, reduced popSize, repeats = 1, ...

– Julius Vainora
Jan 18 at 14:40













@JuliusVainora: What popSize should I use and how do I parallise?

– Make42
Jan 19 at 21:47





@JuliusVainora: What popSize should I use and how do I parallise?

– Make42
Jan 19 at 21:47













Not sure what are recommendations for popSize. The end of the "Details" section in ?gafs is on parallelisation.

– Julius Vainora
Jan 19 at 21:54





Not sure what are recommendations for popSize. The end of the "Details" section in ?gafs is on parallelisation.

– Julius Vainora
Jan 19 at 21:54












1 Answer
1






active

oldest

votes


















1














Here is an example of parallelising the gafs() function using the doParallel package and modifying a few other parameters to make it faster. Where possible I include run times.



The original code is using cross-validation (method = "cv") not repeated cross-validation (method = "repeatedcv"), so I believe the repeats = 2 parameter is ignored. I didn't include that option in the parallelised example.



First, using the original code without any modifications or parallelisation:



> library(caret)
> data(iris)

> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))

Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)


I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.



Second, including reduced popSize parameter (from 50 to 20), allowParallel and genParallel options to gafsControl() and finally reduced number of folds (from 10 to 5) in both gafsControl() and trControl():



> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)

> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))

final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988


My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.



The gafsControl() documentation describes allowParallel and genParallel like so:




  • allowParallel: if a parallel backend is loaded and available,
    should the function use it?


  • genParallel: if a parallel backend is loaded and available, should
    'gafs' use it tp parallelize the fitness calculations within
    a generation within a resample?



The caret documentation suggests the allowParallel option will give a bigger run time improvement than the genParallel options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html



I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:



> results.09

Genetic Algorithm Feature Selection

150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'

Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0

Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy

External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)

During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)

In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is

Accuracy Kappa
0.9467 0.9200





share|improve this answer





















  • 1





    a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

    – Make42
    yesterday













  • b) How long did it take the second version to run?

    – Make42
    yesterday











  • b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

    – makeyourownmaker
    yesterday













  • a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

    – makeyourownmaker
    yesterday













  • Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

    – makeyourownmaker
    18 hours ago











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54255794%2fmake-carets-genetric-feature-selection-faster%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














Here is an example of parallelising the gafs() function using the doParallel package and modifying a few other parameters to make it faster. Where possible I include run times.



The original code is using cross-validation (method = "cv") not repeated cross-validation (method = "repeatedcv"), so I believe the repeats = 2 parameter is ignored. I didn't include that option in the parallelised example.



First, using the original code without any modifications or parallelisation:



> library(caret)
> data(iris)

> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))

Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)


I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.



Second, including reduced popSize parameter (from 50 to 20), allowParallel and genParallel options to gafsControl() and finally reduced number of folds (from 10 to 5) in both gafsControl() and trControl():



> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)

> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))

final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988


My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.



The gafsControl() documentation describes allowParallel and genParallel like so:




  • allowParallel: if a parallel backend is loaded and available,
    should the function use it?


  • genParallel: if a parallel backend is loaded and available, should
    'gafs' use it tp parallelize the fitness calculations within
    a generation within a resample?



The caret documentation suggests the allowParallel option will give a bigger run time improvement than the genParallel options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html



I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:



> results.09

Genetic Algorithm Feature Selection

150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'

Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0

Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy

External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)

During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)

In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is

Accuracy Kappa
0.9467 0.9200





share|improve this answer





















  • 1





    a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

    – Make42
    yesterday













  • b) How long did it take the second version to run?

    – Make42
    yesterday











  • b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

    – makeyourownmaker
    yesterday













  • a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

    – makeyourownmaker
    yesterday













  • Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

    – makeyourownmaker
    18 hours ago
















1














Here is an example of parallelising the gafs() function using the doParallel package and modifying a few other parameters to make it faster. Where possible I include run times.



The original code is using cross-validation (method = "cv") not repeated cross-validation (method = "repeatedcv"), so I believe the repeats = 2 parameter is ignored. I didn't include that option in the parallelised example.



First, using the original code without any modifications or parallelisation:



> library(caret)
> data(iris)

> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))

Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)


I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.



Second, including reduced popSize parameter (from 50 to 20), allowParallel and genParallel options to gafsControl() and finally reduced number of folds (from 10 to 5) in both gafsControl() and trControl():



> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)

> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))

final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988


My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.



The gafsControl() documentation describes allowParallel and genParallel like so:




  • allowParallel: if a parallel backend is loaded and available,
    should the function use it?


  • genParallel: if a parallel backend is loaded and available, should
    'gafs' use it tp parallelize the fitness calculations within
    a generation within a resample?



The caret documentation suggests the allowParallel option will give a bigger run time improvement than the genParallel options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html



I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:



> results.09

Genetic Algorithm Feature Selection

150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'

Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0

Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy

External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)

During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)

In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is

Accuracy Kappa
0.9467 0.9200





share|improve this answer





















  • 1





    a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

    – Make42
    yesterday













  • b) How long did it take the second version to run?

    – Make42
    yesterday











  • b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

    – makeyourownmaker
    yesterday













  • a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

    – makeyourownmaker
    yesterday













  • Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

    – makeyourownmaker
    18 hours ago














1












1








1







Here is an example of parallelising the gafs() function using the doParallel package and modifying a few other parameters to make it faster. Where possible I include run times.



The original code is using cross-validation (method = "cv") not repeated cross-validation (method = "repeatedcv"), so I believe the repeats = 2 parameter is ignored. I didn't include that option in the parallelised example.



First, using the original code without any modifications or parallelisation:



> library(caret)
> data(iris)

> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))

Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)


I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.



Second, including reduced popSize parameter (from 50 to 20), allowParallel and genParallel options to gafsControl() and finally reduced number of folds (from 10 to 5) in both gafsControl() and trControl():



> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)

> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))

final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988


My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.



The gafsControl() documentation describes allowParallel and genParallel like so:




  • allowParallel: if a parallel backend is loaded and available,
    should the function use it?


  • genParallel: if a parallel backend is loaded and available, should
    'gafs' use it tp parallelize the fitness calculations within
    a generation within a resample?



The caret documentation suggests the allowParallel option will give a bigger run time improvement than the genParallel options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html



I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:



> results.09

Genetic Algorithm Feature Selection

150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'

Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0

Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy

External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)

During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)

In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is

Accuracy Kappa
0.9467 0.9200





share|improve this answer















Here is an example of parallelising the gafs() function using the doParallel package and modifying a few other parameters to make it faster. Where possible I include run times.



The original code is using cross-validation (method = "cv") not repeated cross-validation (method = "repeatedcv"), so I believe the repeats = 2 parameter is ignored. I didn't include that option in the parallelised example.



First, using the original code without any modifications or parallelisation:



> library(caret)
> data(iris)

> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))

Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)


I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.



Second, including reduced popSize parameter (from 50 to 20), allowParallel and genParallel options to gafsControl() and finally reduced number of folds (from 10 to 5) in both gafsControl() and trControl():



> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)

> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))

final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988


My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.



The gafsControl() documentation describes allowParallel and genParallel like so:




  • allowParallel: if a parallel backend is loaded and available,
    should the function use it?


  • genParallel: if a parallel backend is loaded and available, should
    'gafs' use it tp parallelize the fitness calculations within
    a generation within a resample?



The caret documentation suggests the allowParallel option will give a bigger run time improvement than the genParallel options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html



I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:



> results.09

Genetic Algorithm Feature Selection

150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'

Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0

Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy

External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)

During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)

In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is

Accuracy Kappa
0.9467 0.9200






share|improve this answer














share|improve this answer



share|improve this answer








edited yesterday

























answered yesterday









makeyourownmakermakeyourownmaker

620521




620521








  • 1





    a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

    – Make42
    yesterday













  • b) How long did it take the second version to run?

    – Make42
    yesterday











  • b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

    – makeyourownmaker
    yesterday













  • a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

    – makeyourownmaker
    yesterday













  • Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

    – makeyourownmaker
    18 hours ago














  • 1





    a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

    – Make42
    yesterday













  • b) How long did it take the second version to run?

    – Make42
    yesterday











  • b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

    – makeyourownmaker
    yesterday













  • a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

    – makeyourownmaker
    yesterday













  • Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

    – makeyourownmaker
    18 hours ago








1




1





a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

– Make42
yesterday







a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

– Make42
yesterday















b) How long did it take the second version to run?

– Make42
yesterday





b) How long did it take the second version to run?

– Make42
yesterday













b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

– makeyourownmaker
yesterday







b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

– makeyourownmaker
yesterday















a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

– makeyourownmaker
yesterday







a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

– makeyourownmaker
yesterday















Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

– makeyourownmaker
18 hours ago





Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

– makeyourownmaker
18 hours ago


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54255794%2fmake-carets-genetric-feature-selection-faster%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Homophylophilia

Updating UILabel text programmatically using a function

Cloud Functions - OpenCV Videocapture Read method fails for larger files from cloud storage