Make caret's genetric feature selection faster

I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm

results <- gafs(iris[,1:4], iris[,5],

               iters = 2,

               method = "xgbTree",

               metric = "Accuracy",

               gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),

               trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)

               )

this is however very slow and this even though I am just using iters = 2 instead of iters = 200 as would be more appropriate. What can I do to make this faster?

asked Jan 18 at 14:14

Make42

3,03923074

Parallelisation, reduced popSize, repeats = 1, ...

– Julius Vainora
Jan 18 at 14:40

@JuliusVainora: What popSize should I use and how do I parallise?

– Make42
Jan 19 at 21:47

Not sure what are recommendations for popSize. The end of the "Details" section in ?gafs is on parallelisation.

– Julius Vainora
Jan 19 at 21:54

add a comment |

I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm

results <- gafs(iris[,1:4], iris[,5],

               iters = 2,

               method = "xgbTree",

               metric = "Accuracy",

               gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),

               trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)

               )

this is however very slow and this even though I am just using iters = 2 instead of iters = 200 as would be more appropriate. What can I do to make this faster?

asked Jan 18 at 14:14

Make42

3,03923074

Parallelisation, reduced popSize, repeats = 1, ...

– Julius Vainora
Jan 18 at 14:40

@JuliusVainora: What popSize should I use and how do I parallise?

– Make42
Jan 19 at 21:47

Not sure what are recommendations for popSize. The end of the "Details" section in ?gafs is on parallelisation.

– Julius Vainora
Jan 19 at 21:54

add a comment |

I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm

results <- gafs(iris[,1:4], iris[,5],

               iters = 2,

               method = "xgbTree",

               metric = "Accuracy",

               gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),

               trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)

               )

this is however very slow and this even though I am just using iters = 2 instead of iters = 200 as would be more appropriate. What can I do to make this faster?

asked Jan 18 at 14:14

Make42

3,03923074

I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm

results <- gafs(iris[,1:4], iris[,5],

               iters = 2,

               method = "xgbTree",

               metric = "Accuracy",

               gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),

               trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)

               )

this is however very slow and this even though I am just using iters = 2 instead of iters = 200 as would be more appropriate. What can I do to make this faster?

r genetic-algorithm r-caret

asked Jan 18 at 14:14

Make42

3,03923074

asked Jan 18 at 14:14

Make42

3,03923074

asked Jan 18 at 14:14

Make42

3,03923074

asked Jan 18 at 14:14

Make42

3,03923074

asked Jan 18 at 14:14

Make42

3,03923074

Parallelisation, reduced popSize, repeats = 1, ...

– Julius Vainora
Jan 18 at 14:40

@JuliusVainora: What popSize should I use and how do I parallise?

– Make42
Jan 19 at 21:47

Not sure what are recommendations for popSize. The end of the "Details" section in ?gafs is on parallelisation.

– Julius Vainora
Jan 19 at 21:54

add a comment |

Parallelisation, reduced popSize, repeats = 1, ...

– Julius Vainora
Jan 18 at 14:40

@JuliusVainora: What popSize should I use and how do I parallise?

– Make42
Jan 19 at 21:47

Not sure what are recommendations for popSize. The end of the "Details" section in ?gafs is on parallelisation.

– Julius Vainora
Jan 19 at 21:54

Parallelisation, reduced popSize, repeats = 1, ...

– Julius Vainora
Jan 18 at 14:40

@JuliusVainora: What popSize should I use and how do I parallise?

– Make42
Jan 19 at 21:47

Not sure what are recommendations for popSize. The end of the "Details" section in ?gafs is on parallelisation.

– Julius Vainora
Jan 19 at 21:54

add a comment |

1 Answer
1

active

oldest

votes

Here is an example of parallelising the gafs() function using the doParallel package and modifying a few other parameters to make it faster. Where possible I include run times.

The original code is using cross-validation (method = "cv") not repeated cross-validation (method = "repeatedcv"), so I believe the repeats = 2 parameter is ignored. I didn't include that option in the parallelised example.

First, using the original code without any modifications or parallelisation:

> library(caret)

> data(iris)



> set.seed(1)

> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],

                                          iters  = 2, 

                                          method = "xgbTree", 

                                          metric = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method  = "cv", 

                                                                    repeats = 2, 

                                                                    verbose = TRUE),

                                          trConrol = trainControl(method = "cv", 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



Fold01 1 0.9596575 (1)

Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *

Fold02 1 0.9598146 (1)

Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *

Fold03 1 0.9502661 (1)

I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.

Second, including reduced popSize parameter (from 50 to 20), allowParallel and genParallel options to gafsControl() and finally reduced number of folds (from 10 to 5) in both gafsControl() and trControl():

> library(doParallel)

> cl <- makePSOCKcluster(detectCores() - 1)

> registerDoParallel(cl)



> set.seed(1)

> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],

                                          iters   = 2, 

                                          popSize = 20, 

                                          method  = "xgbTree", 

                                          metric  = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method    = "cv", 

                                                                    number    = 5, 

                                                                    verbose   = TRUE, 

                                                                    allowParallel = TRUE, 

                                                                    genParallel   = TRUE),

                                          trConrol = trainControl(method      = "cv", 

                                                                  number      = 5, 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



 final GA

 1 0.9508099 (4)

 2 0.9508099->0.9561501 (4->1, 25.0%) *

 final model

> st.09

   user   system  elapsed

   3.536    0.173 4152.988

My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.

The gafsControl() documentation describes allowParallel and genParallel like so:

allowParallel: if a parallel backend is loaded and available,
should the function use it?

genParallel: if a parallel backend is loaded and available, should
'gafs' use it tp parallelize the fitness calculations within
a generation within a resample?

The caret documentation suggests the allowParallel option will give a bigger run time improvement than the genParallel options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html

I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:

> results.09



Genetic Algorithm Feature Selection



150 samples

4 predictors

3 classes: 'setosa', 'versicolor', 'virginica'



Maximum generations: 2

Population per generation: 20

Crossover probability: 0.8

Mutation probability: 0.1

Elitism: 0



Internal performance values: Accuracy, Kappa

Subset selection driven to maximize internal Accuracy



External performance values: Accuracy, Kappa

Best iteration chose by maximizing external Accuracy

External resampling method: Cross-Validated (5 fold)



During resampling:

  * the top 4 selected variables (out of a possible 4):

    Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)

  * on average, 1.6 variables were selected (min = 1, max = 4)



In the final search using the entire training set:

   * 4 features selected at iteration 1 including:

     Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

   * external performance at this iteration is



   Accuracy       Kappa

     0.9467      0.9200

edited yesterday

answered yesterday

makeyourownmaker

620521

1

a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

– Make42
yesterday

b) How long did it take the second version to run?

– Make42
yesterday

b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

– makeyourownmaker
yesterday

a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

– makeyourownmaker
yesterday

Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

– makeyourownmaker
18 hours ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54255794%2fmake-carets-genetric-feature-selection-faster%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Here is an example of parallelising the gafs() function using the doParallel package and modifying a few other parameters to make it faster. Where possible I include run times.

First, using the original code without any modifications or parallelisation:

> library(caret)

> data(iris)



> set.seed(1)

> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],

                                          iters  = 2, 

                                          method = "xgbTree", 

                                          metric = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method  = "cv", 

                                                                    repeats = 2, 

                                                                    verbose = TRUE),

                                          trConrol = trainControl(method = "cv", 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



Fold01 1 0.9596575 (1)

Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *

Fold02 1 0.9598146 (1)

Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *

Fold03 1 0.9502661 (1)

I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.

> library(doParallel)

> cl <- makePSOCKcluster(detectCores() - 1)

> registerDoParallel(cl)



> set.seed(1)

> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],

                                          iters   = 2, 

                                          popSize = 20, 

                                          method  = "xgbTree", 

                                          metric  = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method    = "cv", 

                                                                    number    = 5, 

                                                                    verbose   = TRUE, 

                                                                    allowParallel = TRUE, 

                                                                    genParallel   = TRUE),

                                          trConrol = trainControl(method      = "cv", 

                                                                  number      = 5, 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



 final GA

 1 0.9508099 (4)

 2 0.9508099->0.9561501 (4->1, 25.0%) *

 final model

> st.09

   user   system  elapsed

   3.536    0.173 4152.988

My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.

The gafsControl() documentation describes allowParallel and genParallel like so:

allowParallel: if a parallel backend is loaded and available,
should the function use it?

genParallel: if a parallel backend is loaded and available, should
'gafs' use it tp parallelize the fitness calculations within
a generation within a resample?

I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:

> results.09



Genetic Algorithm Feature Selection



150 samples

4 predictors

3 classes: 'setosa', 'versicolor', 'virginica'



Maximum generations: 2

Population per generation: 20

Crossover probability: 0.8

Mutation probability: 0.1

Elitism: 0



Internal performance values: Accuracy, Kappa

Subset selection driven to maximize internal Accuracy



External performance values: Accuracy, Kappa

Best iteration chose by maximizing external Accuracy

External resampling method: Cross-Validated (5 fold)



During resampling:

  * the top 4 selected variables (out of a possible 4):

    Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)

  * on average, 1.6 variables were selected (min = 1, max = 4)



In the final search using the entire training set:

   * 4 features selected at iteration 1 including:

     Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

   * external performance at this iteration is



   Accuracy       Kappa

     0.9467      0.9200

edited yesterday

answered yesterday

makeyourownmaker

620521

1

a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

– Make42
yesterday

b) How long did it take the second version to run?

– Make42
yesterday

b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

– makeyourownmaker
yesterday

a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

– makeyourownmaker
yesterday

Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

– makeyourownmaker
18 hours ago

add a comment |

Here is an example of parallelising the gafs() function using the doParallel package and modifying a few other parameters to make it faster. Where possible I include run times.

First, using the original code without any modifications or parallelisation:

> library(caret)

> data(iris)



> set.seed(1)

> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],

                                          iters  = 2, 

                                          method = "xgbTree", 

                                          metric = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method  = "cv", 

                                                                    repeats = 2, 

                                                                    verbose = TRUE),

                                          trConrol = trainControl(method = "cv", 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



Fold01 1 0.9596575 (1)

Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *

Fold02 1 0.9598146 (1)

Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *

Fold03 1 0.9502661 (1)

I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.

> library(doParallel)

> cl <- makePSOCKcluster(detectCores() - 1)

> registerDoParallel(cl)



> set.seed(1)

> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],

                                          iters   = 2, 

                                          popSize = 20, 

                                          method  = "xgbTree", 

                                          metric  = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method    = "cv", 

                                                                    number    = 5, 

                                                                    verbose   = TRUE, 

                                                                    allowParallel = TRUE, 

                                                                    genParallel   = TRUE),

                                          trConrol = trainControl(method      = "cv", 

                                                                  number      = 5, 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



 final GA

 1 0.9508099 (4)

 2 0.9508099->0.9561501 (4->1, 25.0%) *

 final model

> st.09

   user   system  elapsed

   3.536    0.173 4152.988

My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.

The gafsControl() documentation describes allowParallel and genParallel like so:

allowParallel: if a parallel backend is loaded and available,
should the function use it?

genParallel: if a parallel backend is loaded and available, should
'gafs' use it tp parallelize the fitness calculations within
a generation within a resample?

I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:

> results.09



Genetic Algorithm Feature Selection



150 samples

4 predictors

3 classes: 'setosa', 'versicolor', 'virginica'



Maximum generations: 2

Population per generation: 20

Crossover probability: 0.8

Mutation probability: 0.1

Elitism: 0



Internal performance values: Accuracy, Kappa

Subset selection driven to maximize internal Accuracy



External performance values: Accuracy, Kappa

Best iteration chose by maximizing external Accuracy

External resampling method: Cross-Validated (5 fold)



During resampling:

  * the top 4 selected variables (out of a possible 4):

    Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)

  * on average, 1.6 variables were selected (min = 1, max = 4)



In the final search using the entire training set:

   * 4 features selected at iteration 1 including:

     Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

   * external performance at this iteration is



   Accuracy       Kappa

     0.9467      0.9200

edited yesterday

answered yesterday

makeyourownmaker

620521

1

a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

– Make42
yesterday

b) How long did it take the second version to run?

– Make42
yesterday

b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

– makeyourownmaker
yesterday

a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

– makeyourownmaker
yesterday

Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

– makeyourownmaker
18 hours ago

add a comment |

Here is an example of parallelising the gafs() function using the doParallel package and modifying a few other parameters to make it faster. Where possible I include run times.

First, using the original code without any modifications or parallelisation:

> library(caret)

> data(iris)



> set.seed(1)

> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],

                                          iters  = 2, 

                                          method = "xgbTree", 

                                          metric = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method  = "cv", 

                                                                    repeats = 2, 

                                                                    verbose = TRUE),

                                          trConrol = trainControl(method = "cv", 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



Fold01 1 0.9596575 (1)

Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *

Fold02 1 0.9598146 (1)

Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *

Fold03 1 0.9502661 (1)

I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.

> library(doParallel)

> cl <- makePSOCKcluster(detectCores() - 1)

> registerDoParallel(cl)



> set.seed(1)

> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],

                                          iters   = 2, 

                                          popSize = 20, 

                                          method  = "xgbTree", 

                                          metric  = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method    = "cv", 

                                                                    number    = 5, 

                                                                    verbose   = TRUE, 

                                                                    allowParallel = TRUE, 

                                                                    genParallel   = TRUE),

                                          trConrol = trainControl(method      = "cv", 

                                                                  number      = 5, 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



 final GA

 1 0.9508099 (4)

 2 0.9508099->0.9561501 (4->1, 25.0%) *

 final model

> st.09

   user   system  elapsed

   3.536    0.173 4152.988

My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.

The gafsControl() documentation describes allowParallel and genParallel like so:

allowParallel: if a parallel backend is loaded and available,
should the function use it?

genParallel: if a parallel backend is loaded and available, should
'gafs' use it tp parallelize the fitness calculations within
a generation within a resample?

I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:

> results.09



Genetic Algorithm Feature Selection



150 samples

4 predictors

3 classes: 'setosa', 'versicolor', 'virginica'



Maximum generations: 2

Population per generation: 20

Crossover probability: 0.8

Mutation probability: 0.1

Elitism: 0



Internal performance values: Accuracy, Kappa

Subset selection driven to maximize internal Accuracy



External performance values: Accuracy, Kappa

Best iteration chose by maximizing external Accuracy

External resampling method: Cross-Validated (5 fold)



During resampling:

  * the top 4 selected variables (out of a possible 4):

    Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)

  * on average, 1.6 variables were selected (min = 1, max = 4)



In the final search using the entire training set:

   * 4 features selected at iteration 1 including:

     Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

   * external performance at this iteration is



   Accuracy       Kappa

     0.9467      0.9200

edited yesterday

answered yesterday

makeyourownmaker

620521

Here is an example of parallelising the gafs() function using the doParallel package and modifying a few other parameters to make it faster. Where possible I include run times.

First, using the original code without any modifications or parallelisation:

> library(caret)

> data(iris)



> set.seed(1)

> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],

                                          iters  = 2, 

                                          method = "xgbTree", 

                                          metric = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method  = "cv", 

                                                                    repeats = 2, 

                                                                    verbose = TRUE),

                                          trConrol = trainControl(method = "cv", 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



Fold01 1 0.9596575 (1)

Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *

Fold02 1 0.9598146 (1)

Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *

Fold03 1 0.9502661 (1)

I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.

> library(doParallel)

> cl <- makePSOCKcluster(detectCores() - 1)

> registerDoParallel(cl)



> set.seed(1)

> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],

                                          iters   = 2, 

                                          popSize = 20, 

                                          method  = "xgbTree", 

                                          metric  = "Accuracy",

                                          gafsControl = gafsControl(functions = caretGA, 

                                                                    method    = "cv", 

                                                                    number    = 5, 

                                                                    verbose   = TRUE, 

                                                                    allowParallel = TRUE, 

                                                                    genParallel   = TRUE),

                                          trConrol = trainControl(method      = "cv", 

                                                                  number      = 5, 

                                                                  classProbs  = TRUE, 

                                                                  verboseIter = TRUE)))



 final GA

 1 0.9508099 (4)

 2 0.9508099->0.9561501 (4->1, 25.0%) *

 final model

> st.09

   user   system  elapsed

   3.536    0.173 4152.988

My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.

The gafsControl() documentation describes allowParallel and genParallel like so:

allowParallel: if a parallel backend is loaded and available,
should the function use it?

genParallel: if a parallel backend is loaded and available, should
'gafs' use it tp parallelize the fitness calculations within
a generation within a resample?

I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:

> results.09



Genetic Algorithm Feature Selection



150 samples

4 predictors

3 classes: 'setosa', 'versicolor', 'virginica'



Maximum generations: 2

Population per generation: 20

Crossover probability: 0.8

Mutation probability: 0.1

Elitism: 0



Internal performance values: Accuracy, Kappa

Subset selection driven to maximize internal Accuracy



External performance values: Accuracy, Kappa

Best iteration chose by maximizing external Accuracy

External resampling method: Cross-Validated (5 fold)



During resampling:

  * the top 4 selected variables (out of a possible 4):

    Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)

  * on average, 1.6 variables were selected (min = 1, max = 4)



In the final search using the entire training set:

   * 4 features selected at iteration 1 including:

     Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

   * external performance at this iteration is



   Accuracy       Kappa

     0.9467      0.9200

edited yesterday

answered yesterday

makeyourownmaker

620521

edited yesterday

answered yesterday

makeyourownmaker

620521

answered yesterday

makeyourownmaker

620521

answered yesterday

makeyourownmaker

620521

1

a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

– Make42
yesterday

b) How long did it take the second version to run?

– Make42
yesterday

b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

– makeyourownmaker
yesterday

a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

– makeyourownmaker
yesterday

Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

– makeyourownmaker
18 hours ago

add a comment |

1

a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

– Make42
yesterday

b) How long did it take the second version to run?

– Make42
yesterday

b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

– makeyourownmaker
yesterday

a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

– makeyourownmaker
yesterday

Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

– makeyourownmaker
18 hours ago

a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?

– Make42
yesterday

b) How long did it take the second version to run?

– Make42
yesterday

b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.

– makeyourownmaker
yesterday

a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.

– makeyourownmaker
yesterday

Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.

– makeyourownmaker
18 hours ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Brtdku