R h2o model sizes on disk

I am using the h2o package to train a GBM for a churn prediction problem.

all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.

more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.

naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.

moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.

any help is appreciated!
thank you

ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)

asked Jan 18 at 14:15

davide

666

add a comment |

I am using the h2o package to train a GBM for a churn prediction problem.

all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.

naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.

moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.

any help is appreciated!
thank you

ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)

asked Jan 18 at 14:15

davide

666

add a comment |

I am using the h2o package to train a GBM for a churn prediction problem.

all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.

naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.

moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.

any help is appreciated!
thank you

ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)

asked Jan 18 at 14:15

davide

666

I am using the h2o package to train a GBM for a churn prediction problem.

all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.

naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.

moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.

any help is appreciated!
thank you

ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)

r h2o

asked Jan 18 at 14:15

davide

666

asked Jan 18 at 14:15

davide

666

asked Jan 18 at 14:15

davide

666

asked Jan 18 at 14:15

davide

666

asked Jan 18 at 14:15

davide

666

add a comment |

2 Answers
2

active

oldest

votes

If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.

From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.

I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.

Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.

Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)

answered Jan 19 at 11:47

Darren Cook

16.7k765157

after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

– davide
yesterday

1

@davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

– Darren Cook
yesterday

add a comment |

It’s the two things you would expect: the number of trees and the depth.

But it also depends on your data. For GBM, the trees can be cut short depending on the data.

What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:

http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html

Note the 60 MB range does not seem overly large, in general.

edited Jan 18 at 15:42

answered Jan 18 at 15:05

TomKraljevic

2,242613

I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

– davide
Jan 18 at 16:27

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54255814%2fr-h2o-model-sizes-on-disk%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.

answered Jan 19 at 11:47

Darren Cook

16.7k765157

after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

– davide
yesterday

1

@davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

– Darren Cook
yesterday

add a comment |

From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.

answered Jan 19 at 11:47

Darren Cook

16.7k765157

after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

– davide
yesterday

1

@davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

– Darren Cook
yesterday

add a comment |

From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.

answered Jan 19 at 11:47

Darren Cook

16.7k765157

From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.

answered Jan 19 at 11:47

Darren Cook

16.7k765157

answered Jan 19 at 11:47

Darren Cook

16.7k765157

answered Jan 19 at 11:47

Darren Cook

16.7k765157

answered Jan 19 at 11:47

Darren Cook

16.7k765157

after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

– davide
yesterday

1

@davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

– Darren Cook
yesterday

add a comment |

after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

– davide
yesterday

1

@davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

– Darren Cook
yesterday

after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

– davide
yesterday

@davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

– Darren Cook
yesterday

add a comment |

It’s the two things you would expect: the number of trees and the depth.

But it also depends on your data. For GBM, the trees can be cut short depending on the data.

What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:

http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html

Note the 60 MB range does not seem overly large, in general.

edited Jan 18 at 15:42

answered Jan 18 at 15:05

TomKraljevic

2,242613

I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

– davide
Jan 18 at 16:27

add a comment |

It’s the two things you would expect: the number of trees and the depth.

But it also depends on your data. For GBM, the trees can be cut short depending on the data.

What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:

http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html

Note the 60 MB range does not seem overly large, in general.

edited Jan 18 at 15:42

answered Jan 18 at 15:05

TomKraljevic

2,242613

I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

– davide
Jan 18 at 16:27

add a comment |

It’s the two things you would expect: the number of trees and the depth.

But it also depends on your data. For GBM, the trees can be cut short depending on the data.

What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:

http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html

Note the 60 MB range does not seem overly large, in general.

edited Jan 18 at 15:42

answered Jan 18 at 15:05

TomKraljevic

2,242613

It’s the two things you would expect: the number of trees and the depth.

But it also depends on your data. For GBM, the trees can be cut short depending on the data.

What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:

http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html

Note the 60 MB range does not seem overly large, in general.

edited Jan 18 at 15:42

answered Jan 18 at 15:05

TomKraljevic

2,242613

edited Jan 18 at 15:42

answered Jan 18 at 15:05

TomKraljevic

2,242613

answered Jan 18 at 15:05

TomKraljevic

2,242613

answered Jan 18 at 15:05

TomKraljevic

2,242613

I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

– davide
Jan 18 at 16:27

add a comment |

I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

– davide
Jan 18 at 16:27

I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

– davide
Jan 18 at 16:27

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Brtdku