Extract most important keywords from a set of documents












0















I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).



I have tried the below approaches -



RAKE: It is a Python based keyword extraction library and it failed miserably.



Tf-Idf: It has given me good keywords per document, but we I not able to aggregate them and find keywords that represent the whole group of documents.
Also, just selecting top k words from each document based on Tf-Idf score won't help, right?



Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.



Can you please suggest some good approach (or elaborate how to improve any of the above 3) to solve this problem? Thanks :)










share|improve this question



























    0















    I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).



    I have tried the below approaches -



    RAKE: It is a Python based keyword extraction library and it failed miserably.



    Tf-Idf: It has given me good keywords per document, but we I not able to aggregate them and find keywords that represent the whole group of documents.
    Also, just selecting top k words from each document based on Tf-Idf score won't help, right?



    Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.



    Can you please suggest some good approach (or elaborate how to improve any of the above 3) to solve this problem? Thanks :)










    share|improve this question

























      0












      0








      0








      I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).



      I have tried the below approaches -



      RAKE: It is a Python based keyword extraction library and it failed miserably.



      Tf-Idf: It has given me good keywords per document, but we I not able to aggregate them and find keywords that represent the whole group of documents.
      Also, just selecting top k words from each document based on Tf-Idf score won't help, right?



      Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.



      Can you please suggest some good approach (or elaborate how to improve any of the above 3) to solve this problem? Thanks :)










      share|improve this question














      I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).



      I have tried the below approaches -



      RAKE: It is a Python based keyword extraction library and it failed miserably.



      Tf-Idf: It has given me good keywords per document, but we I not able to aggregate them and find keywords that represent the whole group of documents.
      Also, just selecting top k words from each document based on Tf-Idf score won't help, right?



      Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.



      Can you please suggest some good approach (or elaborate how to improve any of the above 3) to solve this problem? Thanks :)







      nlp rake feature-extraction word2vec tf-idf






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Aug 24 '17 at 12:07









      VijenderVijender

      801217




      801217
























          3 Answers
          3






          active

          oldest

          votes


















          0














          Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3



          import os
          files = os.listdir()
          topWords = ["word1", "word2.... etc"]
          wordsCount = 0
          for file in files:
          file_opened = open(file, "r")
          lines = file_opened.read().split("n")
          for word in topWords:
          if word in lines and wordsCount < 301:
          print("I found %s" %word)
          wordsCount += 1
          #Check Again wordsCount to close first repetitive instruction
          if wordsCount == 300:
          break





          share|improve this answer































            0














            Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.



            import java.util.List;

            /**
            * Class to calculate TfIdf of term.
            * @author Mubin Shrestha
            */
            public class TfIdf {

            /**
            * Calculates the tf of term termToCheck
            * @param totalterms : Array of all the words under processing document
            * @param termToCheck : term of which tf is to be calculated.
            * @return tf(term frequency) of term termToCheck
            */
            public double tfCalculator(String totalterms, String termToCheck) {
            double count = 0; //to count the overall occurrence of the term termToCheck
            for (String s : totalterms) {
            if (s.equalsIgnoreCase(termToCheck)) {
            count++;
            }
            }
            return count / totalterms.length;
            }

            /**
            * Calculates idf of term termToCheck
            * @param allTerms : all the terms of all the documents
            * @param termToCheck
            * @return idf(inverse document frequency) score
            */
            public double idfCalculator(List allTerms, String termToCheck) {
            double count = 0;
            for (String ss : allTerms) {
            for (String s : ss) {
            if (s.equalsIgnoreCase(termToCheck)) {
            count++;
            break;
            }
            }
            }
            return 1 + Math.log(allTerms.size() / count);
            }
            }





            share|improve this answer
























            • Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

              – Vijender
              Aug 28 '17 at 5:03





















            -1














            import os
            import operator
            from collections import defaultdict
            files = os.listdir()
            topWords = ["word1", "word2.... etc"]
            wordsCount = 0
            words = defaultdict(lambda: 0)
            for file in files:
            open_file = open(file, "r")
            for line in open_file.readlines():
            raw_words = line.split()
            for word in raw_words:
            words[word] += 1
            sorted_words = sorted(words.items(), key=operator.itemgetter(1))


            now take top 300 from sorted words, they are the words you want.






            share|improve this answer
























            • Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

              – Vijender
              Aug 28 '17 at 5:07











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45861220%2fextract-most-important-keywords-from-a-set-of-documents%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3



            import os
            files = os.listdir()
            topWords = ["word1", "word2.... etc"]
            wordsCount = 0
            for file in files:
            file_opened = open(file, "r")
            lines = file_opened.read().split("n")
            for word in topWords:
            if word in lines and wordsCount < 301:
            print("I found %s" %word)
            wordsCount += 1
            #Check Again wordsCount to close first repetitive instruction
            if wordsCount == 300:
            break





            share|improve this answer




























              0














              Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3



              import os
              files = os.listdir()
              topWords = ["word1", "word2.... etc"]
              wordsCount = 0
              for file in files:
              file_opened = open(file, "r")
              lines = file_opened.read().split("n")
              for word in topWords:
              if word in lines and wordsCount < 301:
              print("I found %s" %word)
              wordsCount += 1
              #Check Again wordsCount to close first repetitive instruction
              if wordsCount == 300:
              break





              share|improve this answer


























                0












                0








                0







                Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3



                import os
                files = os.listdir()
                topWords = ["word1", "word2.... etc"]
                wordsCount = 0
                for file in files:
                file_opened = open(file, "r")
                lines = file_opened.read().split("n")
                for word in topWords:
                if word in lines and wordsCount < 301:
                print("I found %s" %word)
                wordsCount += 1
                #Check Again wordsCount to close first repetitive instruction
                if wordsCount == 300:
                break





                share|improve this answer













                Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3



                import os
                files = os.listdir()
                topWords = ["word1", "word2.... etc"]
                wordsCount = 0
                for file in files:
                file_opened = open(file, "r")
                lines = file_opened.read().split("n")
                for word in topWords:
                if word in lines and wordsCount < 301:
                print("I found %s" %word)
                wordsCount += 1
                #Check Again wordsCount to close first repetitive instruction
                if wordsCount == 300:
                break






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Aug 24 '17 at 12:21









                durduliu2009durduliu2009

                1079




                1079

























                    0














                    Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.



                    import java.util.List;

                    /**
                    * Class to calculate TfIdf of term.
                    * @author Mubin Shrestha
                    */
                    public class TfIdf {

                    /**
                    * Calculates the tf of term termToCheck
                    * @param totalterms : Array of all the words under processing document
                    * @param termToCheck : term of which tf is to be calculated.
                    * @return tf(term frequency) of term termToCheck
                    */
                    public double tfCalculator(String totalterms, String termToCheck) {
                    double count = 0; //to count the overall occurrence of the term termToCheck
                    for (String s : totalterms) {
                    if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                    }
                    }
                    return count / totalterms.length;
                    }

                    /**
                    * Calculates idf of term termToCheck
                    * @param allTerms : all the terms of all the documents
                    * @param termToCheck
                    * @return idf(inverse document frequency) score
                    */
                    public double idfCalculator(List allTerms, String termToCheck) {
                    double count = 0;
                    for (String ss : allTerms) {
                    for (String s : ss) {
                    if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                    break;
                    }
                    }
                    }
                    return 1 + Math.log(allTerms.size() / count);
                    }
                    }





                    share|improve this answer
























                    • Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

                      – Vijender
                      Aug 28 '17 at 5:03


















                    0














                    Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.



                    import java.util.List;

                    /**
                    * Class to calculate TfIdf of term.
                    * @author Mubin Shrestha
                    */
                    public class TfIdf {

                    /**
                    * Calculates the tf of term termToCheck
                    * @param totalterms : Array of all the words under processing document
                    * @param termToCheck : term of which tf is to be calculated.
                    * @return tf(term frequency) of term termToCheck
                    */
                    public double tfCalculator(String totalterms, String termToCheck) {
                    double count = 0; //to count the overall occurrence of the term termToCheck
                    for (String s : totalterms) {
                    if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                    }
                    }
                    return count / totalterms.length;
                    }

                    /**
                    * Calculates idf of term termToCheck
                    * @param allTerms : all the terms of all the documents
                    * @param termToCheck
                    * @return idf(inverse document frequency) score
                    */
                    public double idfCalculator(List allTerms, String termToCheck) {
                    double count = 0;
                    for (String ss : allTerms) {
                    for (String s : ss) {
                    if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                    break;
                    }
                    }
                    }
                    return 1 + Math.log(allTerms.size() / count);
                    }
                    }





                    share|improve this answer
























                    • Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

                      – Vijender
                      Aug 28 '17 at 5:03
















                    0












                    0








                    0







                    Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.



                    import java.util.List;

                    /**
                    * Class to calculate TfIdf of term.
                    * @author Mubin Shrestha
                    */
                    public class TfIdf {

                    /**
                    * Calculates the tf of term termToCheck
                    * @param totalterms : Array of all the words under processing document
                    * @param termToCheck : term of which tf is to be calculated.
                    * @return tf(term frequency) of term termToCheck
                    */
                    public double tfCalculator(String totalterms, String termToCheck) {
                    double count = 0; //to count the overall occurrence of the term termToCheck
                    for (String s : totalterms) {
                    if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                    }
                    }
                    return count / totalterms.length;
                    }

                    /**
                    * Calculates idf of term termToCheck
                    * @param allTerms : all the terms of all the documents
                    * @param termToCheck
                    * @return idf(inverse document frequency) score
                    */
                    public double idfCalculator(List allTerms, String termToCheck) {
                    double count = 0;
                    for (String ss : allTerms) {
                    for (String s : ss) {
                    if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                    break;
                    }
                    }
                    }
                    return 1 + Math.log(allTerms.size() / count);
                    }
                    }





                    share|improve this answer













                    Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.



                    import java.util.List;

                    /**
                    * Class to calculate TfIdf of term.
                    * @author Mubin Shrestha
                    */
                    public class TfIdf {

                    /**
                    * Calculates the tf of term termToCheck
                    * @param totalterms : Array of all the words under processing document
                    * @param termToCheck : term of which tf is to be calculated.
                    * @return tf(term frequency) of term termToCheck
                    */
                    public double tfCalculator(String totalterms, String termToCheck) {
                    double count = 0; //to count the overall occurrence of the term termToCheck
                    for (String s : totalterms) {
                    if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                    }
                    }
                    return count / totalterms.length;
                    }

                    /**
                    * Calculates idf of term termToCheck
                    * @param allTerms : all the terms of all the documents
                    * @param termToCheck
                    * @return idf(inverse document frequency) score
                    */
                    public double idfCalculator(List allTerms, String termToCheck) {
                    double count = 0;
                    for (String ss : allTerms) {
                    for (String s : ss) {
                    if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                    break;
                    }
                    }
                    }
                    return 1 + Math.log(allTerms.size() / count);
                    }
                    }






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Aug 25 '17 at 18:00









                    shivshiv

                    1299




                    1299













                    • Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

                      – Vijender
                      Aug 28 '17 at 5:03





















                    • Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

                      – Vijender
                      Aug 28 '17 at 5:03



















                    Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

                    – Vijender
                    Aug 28 '17 at 5:03







                    Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents.

                    – Vijender
                    Aug 28 '17 at 5:03













                    -1














                    import os
                    import operator
                    from collections import defaultdict
                    files = os.listdir()
                    topWords = ["word1", "word2.... etc"]
                    wordsCount = 0
                    words = defaultdict(lambda: 0)
                    for file in files:
                    open_file = open(file, "r")
                    for line in open_file.readlines():
                    raw_words = line.split()
                    for word in raw_words:
                    words[word] += 1
                    sorted_words = sorted(words.items(), key=operator.itemgetter(1))


                    now take top 300 from sorted words, they are the words you want.






                    share|improve this answer
























                    • Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

                      – Vijender
                      Aug 28 '17 at 5:07
















                    -1














                    import os
                    import operator
                    from collections import defaultdict
                    files = os.listdir()
                    topWords = ["word1", "word2.... etc"]
                    wordsCount = 0
                    words = defaultdict(lambda: 0)
                    for file in files:
                    open_file = open(file, "r")
                    for line in open_file.readlines():
                    raw_words = line.split()
                    for word in raw_words:
                    words[word] += 1
                    sorted_words = sorted(words.items(), key=operator.itemgetter(1))


                    now take top 300 from sorted words, they are the words you want.






                    share|improve this answer
























                    • Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

                      – Vijender
                      Aug 28 '17 at 5:07














                    -1












                    -1








                    -1







                    import os
                    import operator
                    from collections import defaultdict
                    files = os.listdir()
                    topWords = ["word1", "word2.... etc"]
                    wordsCount = 0
                    words = defaultdict(lambda: 0)
                    for file in files:
                    open_file = open(file, "r")
                    for line in open_file.readlines():
                    raw_words = line.split()
                    for word in raw_words:
                    words[word] += 1
                    sorted_words = sorted(words.items(), key=operator.itemgetter(1))


                    now take top 300 from sorted words, they are the words you want.






                    share|improve this answer













                    import os
                    import operator
                    from collections import defaultdict
                    files = os.listdir()
                    topWords = ["word1", "word2.... etc"]
                    wordsCount = 0
                    words = defaultdict(lambda: 0)
                    for file in files:
                    open_file = open(file, "r")
                    for line in open_file.readlines():
                    raw_words = line.split()
                    for word in raw_words:
                    words[word] += 1
                    sorted_words = sorted(words.items(), key=operator.itemgetter(1))


                    now take top 300 from sorted words, they are the words you want.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Aug 24 '17 at 13:13









                    Awaish KumarAwaish Kumar

                    1099




                    1099













                    • Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

                      – Vijender
                      Aug 28 '17 at 5:07



















                    • Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

                      – Vijender
                      Aug 28 '17 at 5:07

















                    Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

                    – Vijender
                    Aug 28 '17 at 5:07





                    Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up.

                    – Vijender
                    Aug 28 '17 at 5:07


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45861220%2fextract-most-important-keywords-from-a-set-of-documents%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Callistus III

                    Ostreoida

                    Plistias Cous