unbalanced dataset, anomalies have same distribution as normal data
I worked with a dataset which contains 2 classes (95%, 5%).
And the features of these 2 classes have almost the same distribution.
Question is: How can I classify these 2 classes and explain which principal the model uses to classify the test set?
python data-science anomaly-detection
add a comment |
I worked with a dataset which contains 2 classes (95%, 5%).
And the features of these 2 classes have almost the same distribution.
Question is: How can I classify these 2 classes and explain which principal the model uses to classify the test set?
python data-science anomaly-detection
add a comment |
I worked with a dataset which contains 2 classes (95%, 5%).
And the features of these 2 classes have almost the same distribution.
Question is: How can I classify these 2 classes and explain which principal the model uses to classify the test set?
python data-science anomaly-detection
I worked with a dataset which contains 2 classes (95%, 5%).
And the features of these 2 classes have almost the same distribution.
Question is: How can I classify these 2 classes and explain which principal the model uses to classify the test set?
python data-science anomaly-detection
python data-science anomaly-detection
edited Jan 22 at 2:23
thlpswm
553
553
asked Jan 20 at 15:09
Xuanqi HuangXuanqi Huang
1
1
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Actually the distribution of features makes sense, but you have to make more detailed exploratory analysis than simple distribution of features. I suggest to have a look some 3D plots. You can have a look at some links about EDA:
https://www.kaggle.com/dejavu23/titanic-eda-to-ml-beginner
https://www.kaggle.com/dejavu23/house-prices-eda-to-ml-beginner
Regarding to classification models, I would suggest to have use Decision Tree
based models, such as Random Forest
or Gradient Tree Boosting
.
The idea behind Decision Tree
is partition of feature space and making the same prediction for each part of it. You can plot Decision Trees, using some packages and it will help to understand principles behind the model. You can read more about all these models in the nice book:
http://www-bcf.usc.edu/~gareth/ISL/
Links to packages:
https://lightgbm.readthedocs.io/en/latest/
https://scikit-learn.org/stable/modules/tree.html
https://scikit-learn.org/stable/modules/ensemble.html
You can read about decision tree visualization:
https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176
https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54277791%2funbalanced-dataset-anomalies-have-same-distribution-as-normal-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Actually the distribution of features makes sense, but you have to make more detailed exploratory analysis than simple distribution of features. I suggest to have a look some 3D plots. You can have a look at some links about EDA:
https://www.kaggle.com/dejavu23/titanic-eda-to-ml-beginner
https://www.kaggle.com/dejavu23/house-prices-eda-to-ml-beginner
Regarding to classification models, I would suggest to have use Decision Tree
based models, such as Random Forest
or Gradient Tree Boosting
.
The idea behind Decision Tree
is partition of feature space and making the same prediction for each part of it. You can plot Decision Trees, using some packages and it will help to understand principles behind the model. You can read more about all these models in the nice book:
http://www-bcf.usc.edu/~gareth/ISL/
Links to packages:
https://lightgbm.readthedocs.io/en/latest/
https://scikit-learn.org/stable/modules/tree.html
https://scikit-learn.org/stable/modules/ensemble.html
You can read about decision tree visualization:
https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176
https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn
add a comment |
Actually the distribution of features makes sense, but you have to make more detailed exploratory analysis than simple distribution of features. I suggest to have a look some 3D plots. You can have a look at some links about EDA:
https://www.kaggle.com/dejavu23/titanic-eda-to-ml-beginner
https://www.kaggle.com/dejavu23/house-prices-eda-to-ml-beginner
Regarding to classification models, I would suggest to have use Decision Tree
based models, such as Random Forest
or Gradient Tree Boosting
.
The idea behind Decision Tree
is partition of feature space and making the same prediction for each part of it. You can plot Decision Trees, using some packages and it will help to understand principles behind the model. You can read more about all these models in the nice book:
http://www-bcf.usc.edu/~gareth/ISL/
Links to packages:
https://lightgbm.readthedocs.io/en/latest/
https://scikit-learn.org/stable/modules/tree.html
https://scikit-learn.org/stable/modules/ensemble.html
You can read about decision tree visualization:
https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176
https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn
add a comment |
Actually the distribution of features makes sense, but you have to make more detailed exploratory analysis than simple distribution of features. I suggest to have a look some 3D plots. You can have a look at some links about EDA:
https://www.kaggle.com/dejavu23/titanic-eda-to-ml-beginner
https://www.kaggle.com/dejavu23/house-prices-eda-to-ml-beginner
Regarding to classification models, I would suggest to have use Decision Tree
based models, such as Random Forest
or Gradient Tree Boosting
.
The idea behind Decision Tree
is partition of feature space and making the same prediction for each part of it. You can plot Decision Trees, using some packages and it will help to understand principles behind the model. You can read more about all these models in the nice book:
http://www-bcf.usc.edu/~gareth/ISL/
Links to packages:
https://lightgbm.readthedocs.io/en/latest/
https://scikit-learn.org/stable/modules/tree.html
https://scikit-learn.org/stable/modules/ensemble.html
You can read about decision tree visualization:
https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176
https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn
Actually the distribution of features makes sense, but you have to make more detailed exploratory analysis than simple distribution of features. I suggest to have a look some 3D plots. You can have a look at some links about EDA:
https://www.kaggle.com/dejavu23/titanic-eda-to-ml-beginner
https://www.kaggle.com/dejavu23/house-prices-eda-to-ml-beginner
Regarding to classification models, I would suggest to have use Decision Tree
based models, such as Random Forest
or Gradient Tree Boosting
.
The idea behind Decision Tree
is partition of feature space and making the same prediction for each part of it. You can plot Decision Trees, using some packages and it will help to understand principles behind the model. You can read more about all these models in the nice book:
http://www-bcf.usc.edu/~gareth/ISL/
Links to packages:
https://lightgbm.readthedocs.io/en/latest/
https://scikit-learn.org/stable/modules/tree.html
https://scikit-learn.org/stable/modules/ensemble.html
You can read about decision tree visualization:
https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176
https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn
answered Jan 22 at 6:38
Razmik MelikbekyanRazmik Melikbekyan
112
112
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54277791%2funbalanced-dataset-anomalies-have-same-distribution-as-normal-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown