Extract most important keywords from a set of documents
I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).
I have tried the below approaches -
RAKE: It is a Python based keyword extraction library and it failed miserably.
Tf-Idf: It has given me good keywords per document, but we I not able to aggregate them and find keywords that represent the whole group of documents.
Also, just selecting top k words from each document based on Tf-Idf score won't help, right?
Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.
Can you please suggest some good approach (or elaborate how to improve any of the above 3) to solve this problem? Thanks :)
nlp rake feature-extraction word2vec tf-idf
add a comment |
I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).
I have tried the below approaches -
RAKE: It is a Python based keyword extraction library and it failed miserably.
Tf-Idf: It has given me good keywords per document, but we I not able to aggregate them and find keywords that represent the whole group of documents.
Also, just selecting top k words from each document based on Tf-Idf score won't help, right?
Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.
Can you please suggest some good approach (or elaborate how to improve any of the above 3) to solve this problem? Thanks :)
nlp rake feature-extraction word2vec tf-idf
add a comment |
I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).
I have tried the below approaches -
RAKE: It is a Python based keyword extraction library and it failed miserably.
Tf-Idf: It has given me good keywords per document, but we I not able to aggregate them and find keywords that represent the whole group of documents.
Also, just selecting top k words from each document based on Tf-Idf score won't help, right?
Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.
Can you please suggest some good approach (or elaborate how to improve any of the above 3) to solve this problem? Thanks :)
nlp rake feature-extraction word2vec tf-idf
I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).
I have tried the below approaches -
RAKE: It is a Python based keyword extraction library and it failed miserably.
Tf-Idf: It has given me good keywords per document, but we I not able to aggregate them and find keywords that represent the whole group of documents.
Also, just selecting top k words from each document based on Tf-Idf score won't help, right?
Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.
Can you please suggest some good approach (or elaborate how to improve any of the above 3) to solve this problem? Thanks :)
nlp rake feature-extraction word2vec tf-idf
nlp rake feature-extraction word2vec tf-idf
asked Aug 24 '17 at 12:07
VijenderVijender
801217
801217
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3
import os
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
for file in files:
file_opened = open(file, "r")
lines = file_opened.read().split("n")
for word in topWords:
if word in lines and wordsCount < 301:
print("I found %s" %word)
wordsCount += 1
#Check Again wordsCount to close first repetitive instruction
if wordsCount == 300:
break
add a comment |
Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.
import java.util.List;
/**
* Class to calculate TfIdf of term.
* @author Mubin Shrestha
*/
public class TfIdf {
/**
* Calculates the tf of term termToCheck
* @param totalterms : Array of all the words under processing document
* @param termToCheck : term of which tf is to be calculated.
* @return tf(term frequency) of term termToCheck
*/
public double tfCalculator(String totalterms, String termToCheck) {
double count = 0; //to count the overall occurrence of the term termToCheck
for (String s : totalterms) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
}
}
return count / totalterms.length;
}
/**
* Calculates idf of term termToCheck
* @param allTerms : all the terms of all the documents
* @param termToCheck
* @return idf(inverse document frequency) score
*/
public double idfCalculator(List allTerms, String termToCheck) {
double count = 0;
for (String ss : allTerms) {
for (String s : ss) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
break;
}
}
}
return 1 + Math.log(allTerms.size() / count);
}
}
Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.
– Vijender
Aug 28 '17 at 5:03
add a comment |
import os
import operator
from collections import defaultdict
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
words = defaultdict(lambda: 0)
for file in files:
open_file = open(file, "r")
for line in open_file.readlines():
raw_words = line.split()
for word in raw_words:
words[word] += 1
sorted_words = sorted(words.items(), key=operator.itemgetter(1))
now take top 300 from sorted words, they are the words you want.
Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.
– Vijender
Aug 28 '17 at 5:07
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45861220%2fextract-most-important-keywords-from-a-set-of-documents%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3
import os
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
for file in files:
file_opened = open(file, "r")
lines = file_opened.read().split("n")
for word in topWords:
if word in lines and wordsCount < 301:
print("I found %s" %word)
wordsCount += 1
#Check Again wordsCount to close first repetitive instruction
if wordsCount == 300:
break
add a comment |
Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3
import os
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
for file in files:
file_opened = open(file, "r")
lines = file_opened.read().split("n")
for word in topWords:
if word in lines and wordsCount < 301:
print("I found %s" %word)
wordsCount += 1
#Check Again wordsCount to close first repetitive instruction
if wordsCount == 300:
break
add a comment |
Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3
import os
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
for file in files:
file_opened = open(file, "r")
lines = file_opened.read().split("n")
for word in topWords:
if word in lines and wordsCount < 301:
print("I found %s" %word)
wordsCount += 1
#Check Again wordsCount to close first repetitive instruction
if wordsCount == 300:
break
Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3
import os
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
for file in files:
file_opened = open(file, "r")
lines = file_opened.read().split("n")
for word in topWords:
if word in lines and wordsCount < 301:
print("I found %s" %word)
wordsCount += 1
#Check Again wordsCount to close first repetitive instruction
if wordsCount == 300:
break
answered Aug 24 '17 at 12:21
durduliu2009durduliu2009
1079
1079
add a comment |
add a comment |
Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.
import java.util.List;
/**
* Class to calculate TfIdf of term.
* @author Mubin Shrestha
*/
public class TfIdf {
/**
* Calculates the tf of term termToCheck
* @param totalterms : Array of all the words under processing document
* @param termToCheck : term of which tf is to be calculated.
* @return tf(term frequency) of term termToCheck
*/
public double tfCalculator(String totalterms, String termToCheck) {
double count = 0; //to count the overall occurrence of the term termToCheck
for (String s : totalterms) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
}
}
return count / totalterms.length;
}
/**
* Calculates idf of term termToCheck
* @param allTerms : all the terms of all the documents
* @param termToCheck
* @return idf(inverse document frequency) score
*/
public double idfCalculator(List allTerms, String termToCheck) {
double count = 0;
for (String ss : allTerms) {
for (String s : ss) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
break;
}
}
}
return 1 + Math.log(allTerms.size() / count);
}
}
Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.
– Vijender
Aug 28 '17 at 5:03
add a comment |
Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.
import java.util.List;
/**
* Class to calculate TfIdf of term.
* @author Mubin Shrestha
*/
public class TfIdf {
/**
* Calculates the tf of term termToCheck
* @param totalterms : Array of all the words under processing document
* @param termToCheck : term of which tf is to be calculated.
* @return tf(term frequency) of term termToCheck
*/
public double tfCalculator(String totalterms, String termToCheck) {
double count = 0; //to count the overall occurrence of the term termToCheck
for (String s : totalterms) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
}
}
return count / totalterms.length;
}
/**
* Calculates idf of term termToCheck
* @param allTerms : all the terms of all the documents
* @param termToCheck
* @return idf(inverse document frequency) score
*/
public double idfCalculator(List allTerms, String termToCheck) {
double count = 0;
for (String ss : allTerms) {
for (String s : ss) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
break;
}
}
}
return 1 + Math.log(allTerms.size() / count);
}
}
Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.
– Vijender
Aug 28 '17 at 5:03
add a comment |
Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.
import java.util.List;
/**
* Class to calculate TfIdf of term.
* @author Mubin Shrestha
*/
public class TfIdf {
/**
* Calculates the tf of term termToCheck
* @param totalterms : Array of all the words under processing document
* @param termToCheck : term of which tf is to be calculated.
* @return tf(term frequency) of term termToCheck
*/
public double tfCalculator(String totalterms, String termToCheck) {
double count = 0; //to count the overall occurrence of the term termToCheck
for (String s : totalterms) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
}
}
return count / totalterms.length;
}
/**
* Calculates idf of term termToCheck
* @param allTerms : all the terms of all the documents
* @param termToCheck
* @return idf(inverse document frequency) score
*/
public double idfCalculator(List allTerms, String termToCheck) {
double count = 0;
for (String ss : allTerms) {
for (String s : ss) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
break;
}
}
}
return 1 + Math.log(allTerms.size() / count);
}
}
Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.
import java.util.List;
/**
* Class to calculate TfIdf of term.
* @author Mubin Shrestha
*/
public class TfIdf {
/**
* Calculates the tf of term termToCheck
* @param totalterms : Array of all the words under processing document
* @param termToCheck : term of which tf is to be calculated.
* @return tf(term frequency) of term termToCheck
*/
public double tfCalculator(String totalterms, String termToCheck) {
double count = 0; //to count the overall occurrence of the term termToCheck
for (String s : totalterms) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
}
}
return count / totalterms.length;
}
/**
* Calculates idf of term termToCheck
* @param allTerms : all the terms of all the documents
* @param termToCheck
* @return idf(inverse document frequency) score
*/
public double idfCalculator(List allTerms, String termToCheck) {
double count = 0;
for (String ss : allTerms) {
for (String s : ss) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
break;
}
}
}
return 1 + Math.log(allTerms.size() / count);
}
}
answered Aug 25 '17 at 18:00
shivshiv
1299
1299
Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.
– Vijender
Aug 28 '17 at 5:03
add a comment |
Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.
– Vijender
Aug 28 '17 at 5:03
Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.
– Vijender
Aug 28 '17 at 5:03
Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.
– Vijender
Aug 28 '17 at 5:03
add a comment |
import os
import operator
from collections import defaultdict
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
words = defaultdict(lambda: 0)
for file in files:
open_file = open(file, "r")
for line in open_file.readlines():
raw_words = line.split()
for word in raw_words:
words[word] += 1
sorted_words = sorted(words.items(), key=operator.itemgetter(1))
now take top 300 from sorted words, they are the words you want.
Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.
– Vijender
Aug 28 '17 at 5:07
add a comment |
import os
import operator
from collections import defaultdict
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
words = defaultdict(lambda: 0)
for file in files:
open_file = open(file, "r")
for line in open_file.readlines():
raw_words = line.split()
for word in raw_words:
words[word] += 1
sorted_words = sorted(words.items(), key=operator.itemgetter(1))
now take top 300 from sorted words, they are the words you want.
Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.
– Vijender
Aug 28 '17 at 5:07
add a comment |
import os
import operator
from collections import defaultdict
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
words = defaultdict(lambda: 0)
for file in files:
open_file = open(file, "r")
for line in open_file.readlines():
raw_words = line.split()
for word in raw_words:
words[word] += 1
sorted_words = sorted(words.items(), key=operator.itemgetter(1))
now take top 300 from sorted words, they are the words you want.
import os
import operator
from collections import defaultdict
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
words = defaultdict(lambda: 0)
for file in files:
open_file = open(file, "r")
for line in open_file.readlines():
raw_words = line.split()
for word in raw_words:
words[word] += 1
sorted_words = sorted(words.items(), key=operator.itemgetter(1))
now take top 300 from sorted words, they are the words you want.
answered Aug 24 '17 at 13:13
Awaish KumarAwaish Kumar
1099
1099
Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.
– Vijender
Aug 28 '17 at 5:07
add a comment |
Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.
– Vijender
Aug 28 '17 at 5:07
Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.
– Vijender
Aug 28 '17 at 5:07
Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.
– Vijender
Aug 28 '17 at 5:07
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45861220%2fextract-most-important-keywords-from-a-set-of-documents%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown