Extract most important keywords from a set of documents

I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).

I have tried the below approaches -

RAKE: It is a Python based keyword extraction library and it failed miserably.

Tf-Idf: It has given me good keywords per document, but we I not able to aggregate them and find keywords that represent the whole group of documents.
Also, just selecting top k words from each document based on Tf-Idf score won't help, right?

Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.

Can you please suggest some good approach (or elaborate how to improve any of the above 3) to solve this problem? Thanks :)

asked Aug 24 '17 at 12:07

Vijender

801217

add a comment |

I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).

I have tried the below approaches -

RAKE: It is a Python based keyword extraction library and it failed miserably.

Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.

Can you please suggest some good approach (or elaborate how to improve any of the above 3) to solve this problem? Thanks :)

asked Aug 24 '17 at 12:07

Vijender

801217

add a comment |

I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).

I have tried the below approaches -

RAKE: It is a Python based keyword extraction library and it failed miserably.

Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.

Can you please suggest some good approach (or elaborate how to improve any of the above 3) to solve this problem? Thanks :)

asked Aug 24 '17 at 12:07

Vijender

801217

I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).

I have tried the below approaches -

RAKE: It is a Python based keyword extraction library and it failed miserably.

Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.

Can you please suggest some good approach (or elaborate how to improve any of the above 3) to solve this problem? Thanks :)

nlp rake feature-extraction word2vec tf-idf

asked Aug 24 '17 at 12:07

Vijender

801217

asked Aug 24 '17 at 12:07

Vijender

801217

asked Aug 24 '17 at 12:07

Vijender

801217

asked Aug 24 '17 at 12:07

Vijender

801217

asked Aug 24 '17 at 12:07

Vijender

801217

add a comment |

3 Answers
3

active

oldest

votes

Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3

import os

files = os.listdir()

topWords = ["word1", "word2.... etc"]

wordsCount = 0

for file in files: 

        file_opened = open(file, "r")

        lines = file_opened.read().split("n")

        for word in topWords: 

                if word in lines and wordsCount < 301:

                                print("I found %s" %word)

                                wordsCount += 1

        #Check Again wordsCount to close first repetitive instruction

        if wordsCount == 300:

                break

answered Aug 24 '17 at 12:21

durduliu2009

1079

add a comment |

Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.

import java.util.List;



/**

 * Class to calculate TfIdf of term.

 * @author Mubin Shrestha

 */

public class TfIdf {



    /**

     * Calculates the tf of term termToCheck

     * @param totalterms : Array of all the words under processing document

     * @param termToCheck : term of which tf is to be calculated.

     * @return tf(term frequency) of term termToCheck

     */

    public double tfCalculator(String totalterms, String termToCheck) {

        double count = 0;  //to count the overall occurrence of the term termToCheck

        for (String s : totalterms) {

            if (s.equalsIgnoreCase(termToCheck)) {

                count++;

            }

        }

        return count / totalterms.length;

    }



    /**

     * Calculates idf of term termToCheck

     * @param allTerms : all the terms of all the documents

     * @param termToCheck

     * @return idf(inverse document frequency) score

     */

    public double idfCalculator(List allTerms, String termToCheck) {

        double count = 0;

        for (String ss : allTerms) {

            for (String s : ss) {

                if (s.equalsIgnoreCase(termToCheck)) {

                    count++;

                    break;

                }

            }

        }

        return 1 + Math.log(allTerms.size() / count);

    }

}

answered Aug 25 '17 at 18:00

shiv

1299

Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

– Vijender
Aug 28 '17 at 5:03

add a comment |

-1

import os

import operator

from collections import defaultdict

files = os.listdir()

topWords = ["word1", "word2.... etc"]

wordsCount = 0

words = defaultdict(lambda: 0)

for file in files:

    open_file = open(file, "r")

    for line in open_file.readlines():

        raw_words = line.split()

        for word in raw_words:

            words[word] += 1

sorted_words = sorted(words.items(), key=operator.itemgetter(1))

now take top 300 from sorted words, they are the words you want.

answered Aug 24 '17 at 13:13

Awaish Kumar

1099

Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

– Vijender
Aug 28 '17 at 5:07

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45861220%2fextract-most-important-keywords-from-a-set-of-documents%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3

import os

files = os.listdir()

topWords = ["word1", "word2.... etc"]

wordsCount = 0

for file in files: 

        file_opened = open(file, "r")

        lines = file_opened.read().split("n")

        for word in topWords: 

                if word in lines and wordsCount < 301:

                                print("I found %s" %word)

                                wordsCount += 1

        #Check Again wordsCount to close first repetitive instruction

        if wordsCount == 300:

                break

answered Aug 24 '17 at 12:21

durduliu2009

1079

add a comment |

Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3

import os

files = os.listdir()

topWords = ["word1", "word2.... etc"]

wordsCount = 0

for file in files: 

        file_opened = open(file, "r")

        lines = file_opened.read().split("n")

        for word in topWords: 

                if word in lines and wordsCount < 301:

                                print("I found %s" %word)

                                wordsCount += 1

        #Check Again wordsCount to close first repetitive instruction

        if wordsCount == 300:

                break

answered Aug 24 '17 at 12:21

durduliu2009

1079

add a comment |

Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3

import os

files = os.listdir()

topWords = ["word1", "word2.... etc"]

wordsCount = 0

for file in files: 

        file_opened = open(file, "r")

        lines = file_opened.read().split("n")

        for word in topWords: 

                if word in lines and wordsCount < 301:

                                print("I found %s" %word)

                                wordsCount += 1

        #Check Again wordsCount to close first repetitive instruction

        if wordsCount == 300:

                break

answered Aug 24 '17 at 12:21

durduliu2009

1079

Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3

import os

files = os.listdir()

topWords = ["word1", "word2.... etc"]

wordsCount = 0

for file in files: 

        file_opened = open(file, "r")

        lines = file_opened.read().split("n")

        for word in topWords: 

                if word in lines and wordsCount < 301:

                                print("I found %s" %word)

                                wordsCount += 1

        #Check Again wordsCount to close first repetitive instruction

        if wordsCount == 300:

                break

answered Aug 24 '17 at 12:21

durduliu2009

1079

answered Aug 24 '17 at 12:21

durduliu2009

1079

answered Aug 24 '17 at 12:21

durduliu2009

1079

answered Aug 24 '17 at 12:21

durduliu2009

1079

add a comment |

Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.

import java.util.List;



/**

 * Class to calculate TfIdf of term.

 * @author Mubin Shrestha

 */

public class TfIdf {



    /**

     * Calculates the tf of term termToCheck

     * @param totalterms : Array of all the words under processing document

     * @param termToCheck : term of which tf is to be calculated.

     * @return tf(term frequency) of term termToCheck

     */

    public double tfCalculator(String totalterms, String termToCheck) {

        double count = 0;  //to count the overall occurrence of the term termToCheck

        for (String s : totalterms) {

            if (s.equalsIgnoreCase(termToCheck)) {

                count++;

            }

        }

        return count / totalterms.length;

    }



    /**

     * Calculates idf of term termToCheck

     * @param allTerms : all the terms of all the documents

     * @param termToCheck

     * @return idf(inverse document frequency) score

     */

    public double idfCalculator(List allTerms, String termToCheck) {

        double count = 0;

        for (String ss : allTerms) {

            for (String s : ss) {

                if (s.equalsIgnoreCase(termToCheck)) {

                    count++;

                    break;

                }

            }

        }

        return 1 + Math.log(allTerms.size() / count);

    }

}

answered Aug 25 '17 at 18:00

shiv

1299

Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

– Vijender
Aug 28 '17 at 5:03

add a comment |

Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.

import java.util.List;



/**

 * Class to calculate TfIdf of term.

 * @author Mubin Shrestha

 */

public class TfIdf {



    /**

     * Calculates the tf of term termToCheck

     * @param totalterms : Array of all the words under processing document

     * @param termToCheck : term of which tf is to be calculated.

     * @return tf(term frequency) of term termToCheck

     */

    public double tfCalculator(String totalterms, String termToCheck) {

        double count = 0;  //to count the overall occurrence of the term termToCheck

        for (String s : totalterms) {

            if (s.equalsIgnoreCase(termToCheck)) {

                count++;

            }

        }

        return count / totalterms.length;

    }



    /**

     * Calculates idf of term termToCheck

     * @param allTerms : all the terms of all the documents

     * @param termToCheck

     * @return idf(inverse document frequency) score

     */

    public double idfCalculator(List allTerms, String termToCheck) {

        double count = 0;

        for (String ss : allTerms) {

            for (String s : ss) {

                if (s.equalsIgnoreCase(termToCheck)) {

                    count++;

                    break;

                }

            }

        }

        return 1 + Math.log(allTerms.size() / count);

    }

}

answered Aug 25 '17 at 18:00

shiv

1299

Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

– Vijender
Aug 28 '17 at 5:03

add a comment |

Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.

import java.util.List;



/**

 * Class to calculate TfIdf of term.

 * @author Mubin Shrestha

 */

public class TfIdf {



    /**

     * Calculates the tf of term termToCheck

     * @param totalterms : Array of all the words under processing document

     * @param termToCheck : term of which tf is to be calculated.

     * @return tf(term frequency) of term termToCheck

     */

    public double tfCalculator(String totalterms, String termToCheck) {

        double count = 0;  //to count the overall occurrence of the term termToCheck

        for (String s : totalterms) {

            if (s.equalsIgnoreCase(termToCheck)) {

                count++;

            }

        }

        return count / totalterms.length;

    }



    /**

     * Calculates idf of term termToCheck

     * @param allTerms : all the terms of all the documents

     * @param termToCheck

     * @return idf(inverse document frequency) score

     */

    public double idfCalculator(List allTerms, String termToCheck) {

        double count = 0;

        for (String ss : allTerms) {

            for (String s : ss) {

                if (s.equalsIgnoreCase(termToCheck)) {

                    count++;

                    break;

                }

            }

        }

        return 1 + Math.log(allTerms.size() / count);

    }

}

answered Aug 25 '17 at 18:00

shiv

1299

Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.

import java.util.List;



/**

 * Class to calculate TfIdf of term.

 * @author Mubin Shrestha

 */

public class TfIdf {



    /**

     * Calculates the tf of term termToCheck

     * @param totalterms : Array of all the words under processing document

     * @param termToCheck : term of which tf is to be calculated.

     * @return tf(term frequency) of term termToCheck

     */

    public double tfCalculator(String totalterms, String termToCheck) {

        double count = 0;  //to count the overall occurrence of the term termToCheck

        for (String s : totalterms) {

            if (s.equalsIgnoreCase(termToCheck)) {

                count++;

            }

        }

        return count / totalterms.length;

    }



    /**

     * Calculates idf of term termToCheck

     * @param allTerms : all the terms of all the documents

     * @param termToCheck

     * @return idf(inverse document frequency) score

     */

    public double idfCalculator(List allTerms, String termToCheck) {

        double count = 0;

        for (String ss : allTerms) {

            for (String s : ss) {

                if (s.equalsIgnoreCase(termToCheck)) {

                    count++;

                    break;

                }

            }

        }

        return 1 + Math.log(allTerms.size() / count);

    }

}

answered Aug 25 '17 at 18:00

shiv

1299

answered Aug 25 '17 at 18:00

shiv

1299

answered Aug 25 '17 at 18:00

shiv

1299

answered Aug 25 '17 at 18:00

shiv

1299

Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

– Vijender
Aug 28 '17 at 5:03

add a comment |

Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

– Vijender
Aug 28 '17 at 5:03

Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

– Vijender
Aug 28 '17 at 5:03

add a comment |

-1

import os

import operator

from collections import defaultdict

files = os.listdir()

topWords = ["word1", "word2.... etc"]

wordsCount = 0

words = defaultdict(lambda: 0)

for file in files:

    open_file = open(file, "r")

    for line in open_file.readlines():

        raw_words = line.split()

        for word in raw_words:

            words[word] += 1

sorted_words = sorted(words.items(), key=operator.itemgetter(1))

now take top 300 from sorted words, they are the words you want.

answered Aug 24 '17 at 13:13

Awaish Kumar

1099

Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

– Vijender
Aug 28 '17 at 5:07

add a comment |

-1

import os

import operator

from collections import defaultdict

files = os.listdir()

topWords = ["word1", "word2.... etc"]

wordsCount = 0

words = defaultdict(lambda: 0)

for file in files:

    open_file = open(file, "r")

    for line in open_file.readlines():

        raw_words = line.split()

        for word in raw_words:

            words[word] += 1

sorted_words = sorted(words.items(), key=operator.itemgetter(1))

now take top 300 from sorted words, they are the words you want.

answered Aug 24 '17 at 13:13

Awaish Kumar

1099

Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

– Vijender
Aug 28 '17 at 5:07

add a comment |

-1

import os

import operator

from collections import defaultdict

files = os.listdir()

topWords = ["word1", "word2.... etc"]

wordsCount = 0

words = defaultdict(lambda: 0)

for file in files:

    open_file = open(file, "r")

    for line in open_file.readlines():

        raw_words = line.split()

        for word in raw_words:

            words[word] += 1

sorted_words = sorted(words.items(), key=operator.itemgetter(1))

now take top 300 from sorted words, they are the words you want.

answered Aug 24 '17 at 13:13

Awaish Kumar

1099

import os

import operator

from collections import defaultdict

files = os.listdir()

topWords = ["word1", "word2.... etc"]

wordsCount = 0

words = defaultdict(lambda: 0)

for file in files:

    open_file = open(file, "r")

    for line in open_file.readlines():

        raw_words = line.split()

        for word in raw_words:

            words[word] += 1

sorted_words = sorted(words.items(), key=operator.itemgetter(1))

now take top 300 from sorted words, they are the words you want.

answered Aug 24 '17 at 13:13

Awaish Kumar

1099

answered Aug 24 '17 at 13:13

Awaish Kumar

1099

answered Aug 24 '17 at 13:13

Awaish Kumar

1099

answered Aug 24 '17 at 13:13

Awaish Kumar

1099

Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

– Vijender
Aug 28 '17 at 5:07

add a comment |

Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

– Vijender
Aug 28 '17 at 5:07

Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

– Vijender
Aug 28 '17 at 5:07

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Brtdku