unbalanced dataset, anomalies have same distribution as normal data












0















I worked with a dataset which contains 2 classes (95%, 5%).



And the features of these 2 classes have almost the same distribution.



Question is: How can I classify these 2 classes and explain which principal the model uses to classify the test set?



enter image description here










share|improve this question





























    0















    I worked with a dataset which contains 2 classes (95%, 5%).



    And the features of these 2 classes have almost the same distribution.



    Question is: How can I classify these 2 classes and explain which principal the model uses to classify the test set?



    enter image description here










    share|improve this question



























      0












      0








      0


      1






      I worked with a dataset which contains 2 classes (95%, 5%).



      And the features of these 2 classes have almost the same distribution.



      Question is: How can I classify these 2 classes and explain which principal the model uses to classify the test set?



      enter image description here










      share|improve this question
















      I worked with a dataset which contains 2 classes (95%, 5%).



      And the features of these 2 classes have almost the same distribution.



      Question is: How can I classify these 2 classes and explain which principal the model uses to classify the test set?



      enter image description here







      python data-science anomaly-detection






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 22 at 2:23









      thlpswm

      553




      553










      asked Jan 20 at 15:09









      Xuanqi HuangXuanqi Huang

      1




      1
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Actually the distribution of features makes sense, but you have to make more detailed exploratory analysis than simple distribution of features. I suggest to have a look some 3D plots. You can have a look at some links about EDA:



          https://www.kaggle.com/dejavu23/titanic-eda-to-ml-beginner



          https://www.kaggle.com/dejavu23/house-prices-eda-to-ml-beginner



          Regarding to classification models, I would suggest to have use Decision Tree based models, such as Random Forest or Gradient Tree Boosting.
          The idea behind Decision Tree is partition of feature space and making the same prediction for each part of it. You can plot Decision Trees, using some packages and it will help to understand principles behind the model. You can read more about all these models in the nice book:



          http://www-bcf.usc.edu/~gareth/ISL/



          Links to packages:



          https://lightgbm.readthedocs.io/en/latest/



          https://scikit-learn.org/stable/modules/tree.html



          https://scikit-learn.org/stable/modules/ensemble.html



          You can read about decision tree visualization:



          https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176



          https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54277791%2funbalanced-dataset-anomalies-have-same-distribution-as-normal-data%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            Actually the distribution of features makes sense, but you have to make more detailed exploratory analysis than simple distribution of features. I suggest to have a look some 3D plots. You can have a look at some links about EDA:



            https://www.kaggle.com/dejavu23/titanic-eda-to-ml-beginner



            https://www.kaggle.com/dejavu23/house-prices-eda-to-ml-beginner



            Regarding to classification models, I would suggest to have use Decision Tree based models, such as Random Forest or Gradient Tree Boosting.
            The idea behind Decision Tree is partition of feature space and making the same prediction for each part of it. You can plot Decision Trees, using some packages and it will help to understand principles behind the model. You can read more about all these models in the nice book:



            http://www-bcf.usc.edu/~gareth/ISL/



            Links to packages:



            https://lightgbm.readthedocs.io/en/latest/



            https://scikit-learn.org/stable/modules/tree.html



            https://scikit-learn.org/stable/modules/ensemble.html



            You can read about decision tree visualization:



            https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176



            https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn






            share|improve this answer




























              0














              Actually the distribution of features makes sense, but you have to make more detailed exploratory analysis than simple distribution of features. I suggest to have a look some 3D plots. You can have a look at some links about EDA:



              https://www.kaggle.com/dejavu23/titanic-eda-to-ml-beginner



              https://www.kaggle.com/dejavu23/house-prices-eda-to-ml-beginner



              Regarding to classification models, I would suggest to have use Decision Tree based models, such as Random Forest or Gradient Tree Boosting.
              The idea behind Decision Tree is partition of feature space and making the same prediction for each part of it. You can plot Decision Trees, using some packages and it will help to understand principles behind the model. You can read more about all these models in the nice book:



              http://www-bcf.usc.edu/~gareth/ISL/



              Links to packages:



              https://lightgbm.readthedocs.io/en/latest/



              https://scikit-learn.org/stable/modules/tree.html



              https://scikit-learn.org/stable/modules/ensemble.html



              You can read about decision tree visualization:



              https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176



              https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn






              share|improve this answer


























                0












                0








                0







                Actually the distribution of features makes sense, but you have to make more detailed exploratory analysis than simple distribution of features. I suggest to have a look some 3D plots. You can have a look at some links about EDA:



                https://www.kaggle.com/dejavu23/titanic-eda-to-ml-beginner



                https://www.kaggle.com/dejavu23/house-prices-eda-to-ml-beginner



                Regarding to classification models, I would suggest to have use Decision Tree based models, such as Random Forest or Gradient Tree Boosting.
                The idea behind Decision Tree is partition of feature space and making the same prediction for each part of it. You can plot Decision Trees, using some packages and it will help to understand principles behind the model. You can read more about all these models in the nice book:



                http://www-bcf.usc.edu/~gareth/ISL/



                Links to packages:



                https://lightgbm.readthedocs.io/en/latest/



                https://scikit-learn.org/stable/modules/tree.html



                https://scikit-learn.org/stable/modules/ensemble.html



                You can read about decision tree visualization:



                https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176



                https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn






                share|improve this answer













                Actually the distribution of features makes sense, but you have to make more detailed exploratory analysis than simple distribution of features. I suggest to have a look some 3D plots. You can have a look at some links about EDA:



                https://www.kaggle.com/dejavu23/titanic-eda-to-ml-beginner



                https://www.kaggle.com/dejavu23/house-prices-eda-to-ml-beginner



                Regarding to classification models, I would suggest to have use Decision Tree based models, such as Random Forest or Gradient Tree Boosting.
                The idea behind Decision Tree is partition of feature space and making the same prediction for each part of it. You can plot Decision Trees, using some packages and it will help to understand principles behind the model. You can read more about all these models in the nice book:



                http://www-bcf.usc.edu/~gareth/ISL/



                Links to packages:



                https://lightgbm.readthedocs.io/en/latest/



                https://scikit-learn.org/stable/modules/tree.html



                https://scikit-learn.org/stable/modules/ensemble.html



                You can read about decision tree visualization:



                https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176



                https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jan 22 at 6:38









                Razmik MelikbekyanRazmik Melikbekyan

                112




                112
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54277791%2funbalanced-dataset-anomalies-have-same-distribution-as-normal-data%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Homophylophilia

                    Updating UILabel text programmatically using a function

                    Cloud Functions - OpenCV Videocapture Read method fails for larger files from cloud storage