Make caret's genetric feature selection faster
I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm
results <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),
trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)
)
this is however very slow and this even though I am just using iters = 2
instead of iters = 200
as would be more appropriate. What can I do to make this faster?
r genetic-algorithm r-caret
add a comment |
I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm
results <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),
trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)
)
this is however very slow and this even though I am just using iters = 2
instead of iters = 200
as would be more appropriate. What can I do to make this faster?
r genetic-algorithm r-caret
Parallelisation, reducedpopSize
,repeats = 1
, ...
– Julius Vainora
Jan 18 at 14:40
@JuliusVainora: WhatpopSize
should I use and how do I parallise?
– Make42
Jan 19 at 21:47
Not sure what are recommendations forpopSize
. The end of the "Details" section in?gafs
is on parallelisation.
– Julius Vainora
Jan 19 at 21:54
add a comment |
I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm
results <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),
trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)
)
this is however very slow and this even though I am just using iters = 2
instead of iters = 200
as would be more appropriate. What can I do to make this faster?
r genetic-algorithm r-caret
I am trying to optimize a xgboost tree by using feature selection with caret's genetic algorithm
results <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),
trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)
)
this is however very slow and this even though I am just using iters = 2
instead of iters = 200
as would be more appropriate. What can I do to make this faster?
r genetic-algorithm r-caret
r genetic-algorithm r-caret
asked Jan 18 at 14:14
Make42Make42
3,03923074
3,03923074
Parallelisation, reducedpopSize
,repeats = 1
, ...
– Julius Vainora
Jan 18 at 14:40
@JuliusVainora: WhatpopSize
should I use and how do I parallise?
– Make42
Jan 19 at 21:47
Not sure what are recommendations forpopSize
. The end of the "Details" section in?gafs
is on parallelisation.
– Julius Vainora
Jan 19 at 21:54
add a comment |
Parallelisation, reducedpopSize
,repeats = 1
, ...
– Julius Vainora
Jan 18 at 14:40
@JuliusVainora: WhatpopSize
should I use and how do I parallise?
– Make42
Jan 19 at 21:47
Not sure what are recommendations forpopSize
. The end of the "Details" section in?gafs
is on parallelisation.
– Julius Vainora
Jan 19 at 21:54
Parallelisation, reduced
popSize
, repeats = 1
, ...– Julius Vainora
Jan 18 at 14:40
Parallelisation, reduced
popSize
, repeats = 1
, ...– Julius Vainora
Jan 18 at 14:40
@JuliusVainora: What
popSize
should I use and how do I parallise?– Make42
Jan 19 at 21:47
@JuliusVainora: What
popSize
should I use and how do I parallise?– Make42
Jan 19 at 21:47
Not sure what are recommendations for
popSize
. The end of the "Details" section in ?gafs
is on parallelisation.– Julius Vainora
Jan 19 at 21:54
Not sure what are recommendations for
popSize
. The end of the "Details" section in ?gafs
is on parallelisation.– Julius Vainora
Jan 19 at 21:54
add a comment |
1 Answer
1
active
oldest
votes
Here is an example of parallelising the gafs()
function using the doParallel
package and modifying a few other parameters to make it faster. Where possible I include run times.
The original code is using cross-validation (method = "cv"
) not repeated cross-validation (method = "repeatedcv"
), so I believe the repeats = 2
parameter is ignored. I didn't include that option in the parallelised example.
First, using the original code without any modifications or parallelisation:
> library(caret)
> data(iris)
> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))
Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)
I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.
Second, including reduced popSize
parameter (from 50 to 20), allowParallel
and genParallel
options to gafsControl()
and finally reduced number
of folds (from 10 to 5) in both gafsControl()
and trControl()
:
> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)
> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))
final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988
My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.
The gafsControl()
documentation describes allowParallel
and genParallel
like so:
allowParallel
: if a parallel backend is loaded and available,
should the function use it?genParallel
: if a parallel backend is loaded and available, should
'gafs' use it tp parallelize the fitness calculations within
a generation within a resample?
The caret documentation suggests the allowParallel
option will give a bigger run time improvement than the genParallel
options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html
I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:
> results.09
Genetic Algorithm Feature Selection
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0
Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy
External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)
During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)
In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is
Accuracy Kappa
0.9467 0.9200
1
a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?
– Make42
yesterday
b) How long did it take the second version to run?
– Make42
yesterday
b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.
– makeyourownmaker
yesterday
a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.
– makeyourownmaker
yesterday
Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.
– makeyourownmaker
18 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54255794%2fmake-carets-genetric-feature-selection-faster%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here is an example of parallelising the gafs()
function using the doParallel
package and modifying a few other parameters to make it faster. Where possible I include run times.
The original code is using cross-validation (method = "cv"
) not repeated cross-validation (method = "repeatedcv"
), so I believe the repeats = 2
parameter is ignored. I didn't include that option in the parallelised example.
First, using the original code without any modifications or parallelisation:
> library(caret)
> data(iris)
> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))
Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)
I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.
Second, including reduced popSize
parameter (from 50 to 20), allowParallel
and genParallel
options to gafsControl()
and finally reduced number
of folds (from 10 to 5) in both gafsControl()
and trControl()
:
> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)
> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))
final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988
My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.
The gafsControl()
documentation describes allowParallel
and genParallel
like so:
allowParallel
: if a parallel backend is loaded and available,
should the function use it?genParallel
: if a parallel backend is loaded and available, should
'gafs' use it tp parallelize the fitness calculations within
a generation within a resample?
The caret documentation suggests the allowParallel
option will give a bigger run time improvement than the genParallel
options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html
I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:
> results.09
Genetic Algorithm Feature Selection
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0
Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy
External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)
During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)
In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is
Accuracy Kappa
0.9467 0.9200
1
a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?
– Make42
yesterday
b) How long did it take the second version to run?
– Make42
yesterday
b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.
– makeyourownmaker
yesterday
a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.
– makeyourownmaker
yesterday
Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.
– makeyourownmaker
18 hours ago
add a comment |
Here is an example of parallelising the gafs()
function using the doParallel
package and modifying a few other parameters to make it faster. Where possible I include run times.
The original code is using cross-validation (method = "cv"
) not repeated cross-validation (method = "repeatedcv"
), so I believe the repeats = 2
parameter is ignored. I didn't include that option in the parallelised example.
First, using the original code without any modifications or parallelisation:
> library(caret)
> data(iris)
> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))
Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)
I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.
Second, including reduced popSize
parameter (from 50 to 20), allowParallel
and genParallel
options to gafsControl()
and finally reduced number
of folds (from 10 to 5) in both gafsControl()
and trControl()
:
> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)
> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))
final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988
My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.
The gafsControl()
documentation describes allowParallel
and genParallel
like so:
allowParallel
: if a parallel backend is loaded and available,
should the function use it?genParallel
: if a parallel backend is loaded and available, should
'gafs' use it tp parallelize the fitness calculations within
a generation within a resample?
The caret documentation suggests the allowParallel
option will give a bigger run time improvement than the genParallel
options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html
I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:
> results.09
Genetic Algorithm Feature Selection
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0
Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy
External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)
During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)
In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is
Accuracy Kappa
0.9467 0.9200
1
a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?
– Make42
yesterday
b) How long did it take the second version to run?
– Make42
yesterday
b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.
– makeyourownmaker
yesterday
a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.
– makeyourownmaker
yesterday
Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.
– makeyourownmaker
18 hours ago
add a comment |
Here is an example of parallelising the gafs()
function using the doParallel
package and modifying a few other parameters to make it faster. Where possible I include run times.
The original code is using cross-validation (method = "cv"
) not repeated cross-validation (method = "repeatedcv"
), so I believe the repeats = 2
parameter is ignored. I didn't include that option in the parallelised example.
First, using the original code without any modifications or parallelisation:
> library(caret)
> data(iris)
> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))
Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)
I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.
Second, including reduced popSize
parameter (from 50 to 20), allowParallel
and genParallel
options to gafsControl()
and finally reduced number
of folds (from 10 to 5) in both gafsControl()
and trControl()
:
> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)
> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))
final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988
My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.
The gafsControl()
documentation describes allowParallel
and genParallel
like so:
allowParallel
: if a parallel backend is loaded and available,
should the function use it?genParallel
: if a parallel backend is loaded and available, should
'gafs' use it tp parallelize the fitness calculations within
a generation within a resample?
The caret documentation suggests the allowParallel
option will give a bigger run time improvement than the genParallel
options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html
I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:
> results.09
Genetic Algorithm Feature Selection
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0
Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy
External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)
During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)
In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is
Accuracy Kappa
0.9467 0.9200
Here is an example of parallelising the gafs()
function using the doParallel
package and modifying a few other parameters to make it faster. Where possible I include run times.
The original code is using cross-validation (method = "cv"
) not repeated cross-validation (method = "repeatedcv"
), so I believe the repeats = 2
parameter is ignored. I didn't include that option in the parallelised example.
First, using the original code without any modifications or parallelisation:
> library(caret)
> data(iris)
> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))
Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)
I ran the above code overnight (8 to 10 hours) but stopped it running because it took too long to finish. A very rough estimate of run time would be at least 24 hours.
Second, including reduced popSize
parameter (from 50 to 20), allowParallel
and genParallel
options to gafsControl()
and finally reduced number
of folds (from 10 to 5) in both gafsControl()
and trControl()
:
> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)
> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))
final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988
My system has 4 cores but as specified it is using only 3, and I verified that it was running 3 R processes.
The gafsControl()
documentation describes allowParallel
and genParallel
like so:
allowParallel
: if a parallel backend is loaded and available,
should the function use it?genParallel
: if a parallel backend is loaded and available, should
'gafs' use it tp parallelize the fitness calculations within
a generation within a resample?
The caret documentation suggests the allowParallel
option will give a bigger run time improvement than the genParallel
options:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html
I would expect at least slightly different results from the parallelised code compared to the original code. Here are the results from the parallelised code:
> results.09
Genetic Algorithm Feature Selection
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0
Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy
External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)
During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)
In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is
Accuracy Kappa
0.9467 0.9200
edited yesterday
answered yesterday
makeyourownmakermakeyourownmaker
620521
620521
1
a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?
– Make42
yesterday
b) How long did it take the second version to run?
– Make42
yesterday
b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.
– makeyourownmaker
yesterday
a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.
– makeyourownmaker
yesterday
Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.
– makeyourownmaker
18 hours ago
add a comment |
1
a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?
– Make42
yesterday
b) How long did it take the second version to run?
– Make42
yesterday
b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.
– makeyourownmaker
yesterday
a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.
– makeyourownmaker
yesterday
Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.
– makeyourownmaker
18 hours ago
1
1
a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?
– Make42
yesterday
a) This is such long training time! The iris dataset is very small. How could anyone hope that xgboost finishes in any reasonable time for a real-life industrial dataset... I am not expert in classification (I am mostly researching unsupervised methods), so I am really surprised - if not confused... What is going on here?
– Make42
yesterday
b) How long did it take the second version to run?
– Make42
yesterday
b) How long did it take the second version to run?
– Make42
yesterday
b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.
– makeyourownmaker
yesterday
b) The total run time for the second version was 4,153 seconds (see elapsed time above) and was stored in the st.09 variable. Details for the three numbers in st.09 can be found with ?proc.time in an R session.
– makeyourownmaker
yesterday
a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.
– makeyourownmaker
yesterday
a) I agree, training time is too long. There are two levels of cross-validation though. I'd bet gafs() isn't slow on the xgboost aspect unless it's being called too many times. My guess is that it's slow on the genetic algorithm aspect. One alternative feature selection method within the caret package is the 'simulated annealing for feature selection' function, safs(), but again that has two levels of cross-validation. A slightly different option is using a regularisation method for feature selection, like glmnet, with the train() function which would have only one level of cross-validation.
– makeyourownmaker
yesterday
Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.
– makeyourownmaker
18 hours ago
Tree based ensemble methods like xgboost should be fairly robust to irrelevant variables. Variables which cannot discriminate between outcomes will not be selected for tree splitting. The xgb.plot.importance() function in the xgboost R library plots feature importance as a bar graph. Excluding features which are consistently at the bottom of the variable importance chart is worth considering. The caret train() function with method='xgboost' will be a great deal faster than the gafs() function on the iris data.
– makeyourownmaker
18 hours ago
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54255794%2fmake-carets-genetric-feature-selection-faster%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Parallelisation, reduced
popSize
,repeats = 1
, ...– Julius Vainora
Jan 18 at 14:40
@JuliusVainora: What
popSize
should I use and how do I parallise?– Make42
Jan 19 at 21:47
Not sure what are recommendations for
popSize
. The end of the "Details" section in?gafs
is on parallelisation.– Julius Vainora
Jan 19 at 21:54