R h2o model sizes on disk
I am using the h2o
package to train a GBM for a churn prediction problem.
all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()
), but unfortunately I wasn't able to find an answer anywhere.
more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()
) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.
naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.
moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.
any help is appreciated!
thank you
ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)
r h2o
add a comment |
I am using the h2o
package to train a GBM for a churn prediction problem.
all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()
), but unfortunately I wasn't able to find an answer anywhere.
more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()
) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.
naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.
moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.
any help is appreciated!
thank you
ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)
r h2o
add a comment |
I am using the h2o
package to train a GBM for a churn prediction problem.
all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()
), but unfortunately I wasn't able to find an answer anywhere.
more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()
) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.
naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.
moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.
any help is appreciated!
thank you
ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)
r h2o
I am using the h2o
package to train a GBM for a churn prediction problem.
all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()
), but unfortunately I wasn't able to find an answer anywhere.
more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()
) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.
naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.
moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.
any help is appreciated!
thank you
ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)
r h2o
r h2o
asked Jan 18 at 14:15
davidedavide
666
666
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.
From R, if m
is your model, just printing it gives you most of that information. str(m)
gives you all the information that is held.
I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.
Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.
Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth
as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)
after comparing the parameters of the three models, the most important difference seems to be themin_rows
pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number ofmean_leaves
, around 7 times the values of the other models (this should be a consequence of themin_row
pmt, right?). themax_depth
pmt is quite similar across the models, i.e. 19, 17, 17, whereasntrees
is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure
– davide
yesterday
1
@davide A small value formin_rows
would definitely explain a much larger model. I'd be very wary usingmin_rows
of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…
– Darren Cook
yesterday
add a comment |
It’s the two things you would expect: the number of trees and the depth.
But it also depends on your data. For GBM, the trees can be cut short depending on the data.
What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:
- http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
Note the 60 MB range does not seem overly large, in general.
I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom
– davide
Jan 18 at 16:27
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54255814%2fr-h2o-model-sizes-on-disk%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.
From R, if m
is your model, just printing it gives you most of that information. str(m)
gives you all the information that is held.
I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.
Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.
Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth
as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)
after comparing the parameters of the three models, the most important difference seems to be themin_rows
pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number ofmean_leaves
, around 7 times the values of the other models (this should be a consequence of themin_row
pmt, right?). themax_depth
pmt is quite similar across the models, i.e. 19, 17, 17, whereasntrees
is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure
– davide
yesterday
1
@davide A small value formin_rows
would definitely explain a much larger model. I'd be very wary usingmin_rows
of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…
– Darren Cook
yesterday
add a comment |
If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.
From R, if m
is your model, just printing it gives you most of that information. str(m)
gives you all the information that is held.
I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.
Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.
Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth
as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)
after comparing the parameters of the three models, the most important difference seems to be themin_rows
pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number ofmean_leaves
, around 7 times the values of the other models (this should be a consequence of themin_row
pmt, right?). themax_depth
pmt is quite similar across the models, i.e. 19, 17, 17, whereasntrees
is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure
– davide
yesterday
1
@davide A small value formin_rows
would definitely explain a much larger model. I'd be very wary usingmin_rows
of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…
– Darren Cook
yesterday
add a comment |
If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.
From R, if m
is your model, just printing it gives you most of that information. str(m)
gives you all the information that is held.
I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.
Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.
Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth
as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)
If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.
From R, if m
is your model, just printing it gives you most of that information. str(m)
gives you all the information that is held.
I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.
Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.
Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth
as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)
answered Jan 19 at 11:47
Darren CookDarren Cook
16.7k765157
16.7k765157
after comparing the parameters of the three models, the most important difference seems to be themin_rows
pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number ofmean_leaves
, around 7 times the values of the other models (this should be a consequence of themin_row
pmt, right?). themax_depth
pmt is quite similar across the models, i.e. 19, 17, 17, whereasntrees
is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure
– davide
yesterday
1
@davide A small value formin_rows
would definitely explain a much larger model. I'd be very wary usingmin_rows
of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…
– Darren Cook
yesterday
add a comment |
after comparing the parameters of the three models, the most important difference seems to be themin_rows
pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number ofmean_leaves
, around 7 times the values of the other models (this should be a consequence of themin_row
pmt, right?). themax_depth
pmt is quite similar across the models, i.e. 19, 17, 17, whereasntrees
is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure
– davide
yesterday
1
@davide A small value formin_rows
would definitely explain a much larger model. I'd be very wary usingmin_rows
of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…
– Darren Cook
yesterday
after comparing the parameters of the three models, the most important difference seems to be the
min_rows
pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves
, around 7 times the values of the other models (this should be a consequence of the min_row
pmt, right?). the max_depth
pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees
is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure– davide
yesterday
after comparing the parameters of the three models, the most important difference seems to be the
min_rows
pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves
, around 7 times the values of the other models (this should be a consequence of the min_row
pmt, right?). the max_depth
pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees
is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure– davide
yesterday
1
1
@davide A small value for
min_rows
would definitely explain a much larger model. I'd be very wary using min_rows
of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…– Darren Cook
yesterday
@davide A small value for
min_rows
would definitely explain a much larger model. I'd be very wary using min_rows
of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…– Darren Cook
yesterday
add a comment |
It’s the two things you would expect: the number of trees and the depth.
But it also depends on your data. For GBM, the trees can be cut short depending on the data.
What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:
- http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
Note the 60 MB range does not seem overly large, in general.
I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom
– davide
Jan 18 at 16:27
add a comment |
It’s the two things you would expect: the number of trees and the depth.
But it also depends on your data. For GBM, the trees can be cut short depending on the data.
What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:
- http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
Note the 60 MB range does not seem overly large, in general.
I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom
– davide
Jan 18 at 16:27
add a comment |
It’s the two things you would expect: the number of trees and the depth.
But it also depends on your data. For GBM, the trees can be cut short depending on the data.
What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:
- http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
Note the 60 MB range does not seem overly large, in general.
It’s the two things you would expect: the number of trees and the depth.
But it also depends on your data. For GBM, the trees can be cut short depending on the data.
What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:
- http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
Note the 60 MB range does not seem overly large, in general.
edited Jan 18 at 15:42
answered Jan 18 at 15:05
TomKraljevicTomKraljevic
2,242613
2,242613
I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom
– davide
Jan 18 at 16:27
add a comment |
I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom
– davide
Jan 18 at 16:27
I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom
– davide
Jan 18 at 16:27
I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom
– davide
Jan 18 at 16:27
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54255814%2fr-h2o-model-sizes-on-disk%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown