R h2o model sizes on disk












2















I am using the h2o package to train a GBM for a churn prediction problem.



all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.



more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.



naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.



moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.



any help is appreciated!
thank you



ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)










share|improve this question



























    2















    I am using the h2o package to train a GBM for a churn prediction problem.



    all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.



    more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.



    naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.



    moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.



    any help is appreciated!
    thank you



    ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)










    share|improve this question

























      2












      2








      2








      I am using the h2o package to train a GBM for a churn prediction problem.



      all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.



      more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.



      naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.



      moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.



      any help is appreciated!
      thank you



      ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)










      share|improve this question














      I am using the h2o package to train a GBM for a churn prediction problem.



      all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.



      more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.



      naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.



      moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.



      any help is appreciated!
      thank you



      ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)







      r h2o






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jan 18 at 14:15









      davidedavide

      666




      666
























          2 Answers
          2






          active

          oldest

          votes


















          2














          If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.



          From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.



          I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.



          Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.



          Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
          (If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)






          share|improve this answer
























          • after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

            – davide
            yesterday






          • 1





            @davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

            – Darren Cook
            yesterday



















          4














          It’s the two things you would expect: the number of trees and the depth.



          But it also depends on your data. For GBM, the trees can be cut short depending on the data.



          What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:




          • http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html


          Note the 60 MB range does not seem overly large, in general.






          share|improve this answer


























          • I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

            – davide
            Jan 18 at 16:27











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54255814%2fr-h2o-model-sizes-on-disk%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.



          From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.



          I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.



          Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.



          Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
          (If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)






          share|improve this answer
























          • after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

            – davide
            yesterday






          • 1





            @davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

            – Darren Cook
            yesterday
















          2














          If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.



          From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.



          I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.



          Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.



          Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
          (If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)






          share|improve this answer
























          • after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

            – davide
            yesterday






          • 1





            @davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

            – Darren Cook
            yesterday














          2












          2








          2







          If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.



          From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.



          I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.



          Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.



          Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
          (If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)






          share|improve this answer













          If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.



          From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.



          I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.



          Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.



          Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
          (If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 19 at 11:47









          Darren CookDarren Cook

          16.7k765157




          16.7k765157













          • after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

            – davide
            yesterday






          • 1





            @davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

            – Darren Cook
            yesterday



















          • after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

            – davide
            yesterday






          • 1





            @davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

            – Darren Cook
            yesterday

















          after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

          – davide
          yesterday





          after comparing the parameters of the three models, the most important difference seems to be the min_rows pmt. It is 16 for the first two models, and 1 for the last one. moreover, the last model has a much higher number of mean_leaves, around 7 times the values of the other models (this should be a consequence of the min_row pmt, right?). the max_depth pmt is quite similar across the models, i.e. 19, 17, 17, whereas ntrees is 160, 270, 150. I'll check the grid searches' leaderboards to assess if this difference is due to noise because the three training sets have the same structure

          – davide
          yesterday




          1




          1





          @davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

          – Darren Cook
          yesterday





          @davide A small value for min_rows would definitely explain a much larger model. I'd be very wary using min_rows of 1, because of the risk of over-fitting, unless you have no noise in your data. See also docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/…

          – Darren Cook
          yesterday













          4














          It’s the two things you would expect: the number of trees and the depth.



          But it also depends on your data. For GBM, the trees can be cut short depending on the data.



          What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:




          • http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html


          Note the 60 MB range does not seem overly large, in general.






          share|improve this answer


























          • I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

            – davide
            Jan 18 at 16:27
















          4














          It’s the two things you would expect: the number of trees and the depth.



          But it also depends on your data. For GBM, the trees can be cut short depending on the data.



          What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:




          • http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html


          Note the 60 MB range does not seem overly large, in general.






          share|improve this answer


























          • I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

            – davide
            Jan 18 at 16:27














          4












          4








          4







          It’s the two things you would expect: the number of trees and the depth.



          But it also depends on your data. For GBM, the trees can be cut short depending on the data.



          What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:




          • http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html


          Note the 60 MB range does not seem overly large, in general.






          share|improve this answer















          It’s the two things you would expect: the number of trees and the depth.



          But it also depends on your data. For GBM, the trees can be cut short depending on the data.



          What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:




          • http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html


          Note the 60 MB range does not seem overly large, in general.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 18 at 15:42

























          answered Jan 18 at 15:05









          TomKraljevicTomKraljevic

          2,242613




          2,242613













          • I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

            – davide
            Jan 18 at 16:27



















          • I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

            – davide
            Jan 18 at 16:27

















          I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

          – davide
          Jan 18 at 16:27





          I'll take a close look at those two parameters, maybe their differences explain the different model sizes. however I'm trying to understand why the three models can differ so much in their sizes. thank you Tom

          – davide
          Jan 18 at 16:27


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54255814%2fr-h2o-model-sizes-on-disk%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Liquibase includeAll doesn't find base path

          How to use setInterval in EJS file?

          Petrus Granier-Deferre