Assign an ID based on keywords present in Tweets












1















I have extracted Tweets by feeding in 44 different keywords, and the output is in a file which consists of 400k tweets in total. The output file has tweets that contain the relevant keywords. How could I create a separate ID column which contains the keyword present in that tweet?



Eg: The tweet is:




Andhra Pradesh is the highest state with crimes against women




the keyword here is "crimes against women"



I would like to create a column that assigns the keyword "crimes against women" to the tweet, a sort of ID column to be precise.



#input column 1
Tweet<-("Andhra Pradesh is the highest state with crimes against women")

#expected output column 2 beside the Tweet column
Keyword<-("crimes against women")


Edit: I do not want to extract any part of the tweet, I just want to be able to assign to the tweet, in a new column, the keyword it contains so it will help me segregate the tweets based on this keyword.










share|improve this question

























  • Do you have a list of the keywords that you want to extract from the tweets?

    – A. Stam
    Jan 18 at 12:16











  • Yes, I have the list of the keywords- 44 to be exact. I used the keywords to extract the tweets in the first place.

    – Skurup
    Jan 18 at 12:25











  • Oh, sorry. I thought that is what you were looking for. I misread. Let me re-open your question

    – Sotos
    Jan 18 at 12:30
















1















I have extracted Tweets by feeding in 44 different keywords, and the output is in a file which consists of 400k tweets in total. The output file has tweets that contain the relevant keywords. How could I create a separate ID column which contains the keyword present in that tweet?



Eg: The tweet is:




Andhra Pradesh is the highest state with crimes against women




the keyword here is "crimes against women"



I would like to create a column that assigns the keyword "crimes against women" to the tweet, a sort of ID column to be precise.



#input column 1
Tweet<-("Andhra Pradesh is the highest state with crimes against women")

#expected output column 2 beside the Tweet column
Keyword<-("crimes against women")


Edit: I do not want to extract any part of the tweet, I just want to be able to assign to the tweet, in a new column, the keyword it contains so it will help me segregate the tweets based on this keyword.










share|improve this question

























  • Do you have a list of the keywords that you want to extract from the tweets?

    – A. Stam
    Jan 18 at 12:16











  • Yes, I have the list of the keywords- 44 to be exact. I used the keywords to extract the tweets in the first place.

    – Skurup
    Jan 18 at 12:25











  • Oh, sorry. I thought that is what you were looking for. I misread. Let me re-open your question

    – Sotos
    Jan 18 at 12:30














1












1








1








I have extracted Tweets by feeding in 44 different keywords, and the output is in a file which consists of 400k tweets in total. The output file has tweets that contain the relevant keywords. How could I create a separate ID column which contains the keyword present in that tweet?



Eg: The tweet is:




Andhra Pradesh is the highest state with crimes against women




the keyword here is "crimes against women"



I would like to create a column that assigns the keyword "crimes against women" to the tweet, a sort of ID column to be precise.



#input column 1
Tweet<-("Andhra Pradesh is the highest state with crimes against women")

#expected output column 2 beside the Tweet column
Keyword<-("crimes against women")


Edit: I do not want to extract any part of the tweet, I just want to be able to assign to the tweet, in a new column, the keyword it contains so it will help me segregate the tweets based on this keyword.










share|improve this question
















I have extracted Tweets by feeding in 44 different keywords, and the output is in a file which consists of 400k tweets in total. The output file has tweets that contain the relevant keywords. How could I create a separate ID column which contains the keyword present in that tweet?



Eg: The tweet is:




Andhra Pradesh is the highest state with crimes against women




the keyword here is "crimes against women"



I would like to create a column that assigns the keyword "crimes against women" to the tweet, a sort of ID column to be precise.



#input column 1
Tweet<-("Andhra Pradesh is the highest state with crimes against women")

#expected output column 2 beside the Tweet column
Keyword<-("crimes against women")


Edit: I do not want to extract any part of the tweet, I just want to be able to assign to the tweet, in a new column, the keyword it contains so it will help me segregate the tweets based on this keyword.







r nlp uniqueidentifier






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 18 at 15:39









James Z

11.1k71835




11.1k71835










asked Jan 18 at 12:10









SkurupSkurup

749




749













  • Do you have a list of the keywords that you want to extract from the tweets?

    – A. Stam
    Jan 18 at 12:16











  • Yes, I have the list of the keywords- 44 to be exact. I used the keywords to extract the tweets in the first place.

    – Skurup
    Jan 18 at 12:25











  • Oh, sorry. I thought that is what you were looking for. I misread. Let me re-open your question

    – Sotos
    Jan 18 at 12:30



















  • Do you have a list of the keywords that you want to extract from the tweets?

    – A. Stam
    Jan 18 at 12:16











  • Yes, I have the list of the keywords- 44 to be exact. I used the keywords to extract the tweets in the first place.

    – Skurup
    Jan 18 at 12:25











  • Oh, sorry. I thought that is what you were looking for. I misread. Let me re-open your question

    – Sotos
    Jan 18 at 12:30

















Do you have a list of the keywords that you want to extract from the tweets?

– A. Stam
Jan 18 at 12:16





Do you have a list of the keywords that you want to extract from the tweets?

– A. Stam
Jan 18 at 12:16













Yes, I have the list of the keywords- 44 to be exact. I used the keywords to extract the tweets in the first place.

– Skurup
Jan 18 at 12:25





Yes, I have the list of the keywords- 44 to be exact. I used the keywords to extract the tweets in the first place.

– Skurup
Jan 18 at 12:25













Oh, sorry. I thought that is what you were looking for. I misread. Let me re-open your question

– Sotos
Jan 18 at 12:30





Oh, sorry. I thought that is what you were looking for. I misread. Let me re-open your question

– Sotos
Jan 18 at 12:30












2 Answers
2






active

oldest

votes


















2














You can perform this analysis with the stringr package, however, I don't think you need to use sapply.



Consider the following keyword list and table with tweets:



keyword_list <- c("crimes against women", "downloading tweets", "r analysis")

tweets <- data.frame(
tweet = c("Andhra Pradesh is the highest state with crimes against women",
"I am downloading tweets",
"I love r analysis",
"downloading tweets helps with my r analysis")
)


First, you want to combine your keywords into one regular expression that searches for any of the strings.



keyword_pattern <- paste0(
"(",
paste0(keyword_list, collapse = "|"),
")"
)

keyword_pattern
#> [1] "(crimes against women|downloading tweets|r analysis)"


Finally, we can add a column to the data frame that extracts the keyword from the tweet.



tweets$keyword <- str_extract(tweets$tweet, keyword_pattern)

> tweets
#> tweet keyword
#> 1 Andhra Pradesh is the highest state with crimes against women crimes against women
#> 2 I am downloading tweets downloading tweets
#> 3 I love r analysis r analysis
#> 4 downloading tweets helps with my r analysis downloading tweets


As the final example illustrates, you need to think about what you want to do when a tweet contains multiple keywords. In this case, the keyword returned is simply the first one found in the tweet. However, you can also use str_extract_all to return ALL keywords found in the tweet.






share|improve this answer































    2














    We can use stringr which is very handy for string operations and simply use str_extract, i.e.



    str_extract(Tweet, Keyword)
    #[1] "crimes against women"


    For multiple keywords and multiple strings you need to apply, i.e.



    Keyword <- c("crimes against women", "something")
    Tweet <- c("Andhra Pradesh is the highest state with crimes against women",
    "another string with something else")

    sapply(Tweet, function(i)str_extract(i, paste(Keyword, collapse = '|')))

    # Andhra Pradesh is the highest state with crimes against women another string with something else
    # "crimes against women" "something"





    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54253780%2fassign-an-id-based-on-keywords-present-in-tweets%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      2














      You can perform this analysis with the stringr package, however, I don't think you need to use sapply.



      Consider the following keyword list and table with tweets:



      keyword_list <- c("crimes against women", "downloading tweets", "r analysis")

      tweets <- data.frame(
      tweet = c("Andhra Pradesh is the highest state with crimes against women",
      "I am downloading tweets",
      "I love r analysis",
      "downloading tweets helps with my r analysis")
      )


      First, you want to combine your keywords into one regular expression that searches for any of the strings.



      keyword_pattern <- paste0(
      "(",
      paste0(keyword_list, collapse = "|"),
      ")"
      )

      keyword_pattern
      #> [1] "(crimes against women|downloading tweets|r analysis)"


      Finally, we can add a column to the data frame that extracts the keyword from the tweet.



      tweets$keyword <- str_extract(tweets$tweet, keyword_pattern)

      > tweets
      #> tweet keyword
      #> 1 Andhra Pradesh is the highest state with crimes against women crimes against women
      #> 2 I am downloading tweets downloading tweets
      #> 3 I love r analysis r analysis
      #> 4 downloading tweets helps with my r analysis downloading tweets


      As the final example illustrates, you need to think about what you want to do when a tweet contains multiple keywords. In this case, the keyword returned is simply the first one found in the tweet. However, you can also use str_extract_all to return ALL keywords found in the tweet.






      share|improve this answer




























        2














        You can perform this analysis with the stringr package, however, I don't think you need to use sapply.



        Consider the following keyword list and table with tweets:



        keyword_list <- c("crimes against women", "downloading tweets", "r analysis")

        tweets <- data.frame(
        tweet = c("Andhra Pradesh is the highest state with crimes against women",
        "I am downloading tweets",
        "I love r analysis",
        "downloading tweets helps with my r analysis")
        )


        First, you want to combine your keywords into one regular expression that searches for any of the strings.



        keyword_pattern <- paste0(
        "(",
        paste0(keyword_list, collapse = "|"),
        ")"
        )

        keyword_pattern
        #> [1] "(crimes against women|downloading tweets|r analysis)"


        Finally, we can add a column to the data frame that extracts the keyword from the tweet.



        tweets$keyword <- str_extract(tweets$tweet, keyword_pattern)

        > tweets
        #> tweet keyword
        #> 1 Andhra Pradesh is the highest state with crimes against women crimes against women
        #> 2 I am downloading tweets downloading tweets
        #> 3 I love r analysis r analysis
        #> 4 downloading tweets helps with my r analysis downloading tweets


        As the final example illustrates, you need to think about what you want to do when a tweet contains multiple keywords. In this case, the keyword returned is simply the first one found in the tweet. However, you can also use str_extract_all to return ALL keywords found in the tweet.






        share|improve this answer


























          2












          2








          2







          You can perform this analysis with the stringr package, however, I don't think you need to use sapply.



          Consider the following keyword list and table with tweets:



          keyword_list <- c("crimes against women", "downloading tweets", "r analysis")

          tweets <- data.frame(
          tweet = c("Andhra Pradesh is the highest state with crimes against women",
          "I am downloading tweets",
          "I love r analysis",
          "downloading tweets helps with my r analysis")
          )


          First, you want to combine your keywords into one regular expression that searches for any of the strings.



          keyword_pattern <- paste0(
          "(",
          paste0(keyword_list, collapse = "|"),
          ")"
          )

          keyword_pattern
          #> [1] "(crimes against women|downloading tweets|r analysis)"


          Finally, we can add a column to the data frame that extracts the keyword from the tweet.



          tweets$keyword <- str_extract(tweets$tweet, keyword_pattern)

          > tweets
          #> tweet keyword
          #> 1 Andhra Pradesh is the highest state with crimes against women crimes against women
          #> 2 I am downloading tweets downloading tweets
          #> 3 I love r analysis r analysis
          #> 4 downloading tweets helps with my r analysis downloading tweets


          As the final example illustrates, you need to think about what you want to do when a tweet contains multiple keywords. In this case, the keyword returned is simply the first one found in the tweet. However, you can also use str_extract_all to return ALL keywords found in the tweet.






          share|improve this answer













          You can perform this analysis with the stringr package, however, I don't think you need to use sapply.



          Consider the following keyword list and table with tweets:



          keyword_list <- c("crimes against women", "downloading tweets", "r analysis")

          tweets <- data.frame(
          tweet = c("Andhra Pradesh is the highest state with crimes against women",
          "I am downloading tweets",
          "I love r analysis",
          "downloading tweets helps with my r analysis")
          )


          First, you want to combine your keywords into one regular expression that searches for any of the strings.



          keyword_pattern <- paste0(
          "(",
          paste0(keyword_list, collapse = "|"),
          ")"
          )

          keyword_pattern
          #> [1] "(crimes against women|downloading tweets|r analysis)"


          Finally, we can add a column to the data frame that extracts the keyword from the tweet.



          tweets$keyword <- str_extract(tweets$tweet, keyword_pattern)

          > tweets
          #> tweet keyword
          #> 1 Andhra Pradesh is the highest state with crimes against women crimes against women
          #> 2 I am downloading tweets downloading tweets
          #> 3 I love r analysis r analysis
          #> 4 downloading tweets helps with my r analysis downloading tweets


          As the final example illustrates, you need to think about what you want to do when a tweet contains multiple keywords. In this case, the keyword returned is simply the first one found in the tweet. However, you can also use str_extract_all to return ALL keywords found in the tweet.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 18 at 12:50









          A. StamA. Stam

          820314




          820314

























              2














              We can use stringr which is very handy for string operations and simply use str_extract, i.e.



              str_extract(Tweet, Keyword)
              #[1] "crimes against women"


              For multiple keywords and multiple strings you need to apply, i.e.



              Keyword <- c("crimes against women", "something")
              Tweet <- c("Andhra Pradesh is the highest state with crimes against women",
              "another string with something else")

              sapply(Tweet, function(i)str_extract(i, paste(Keyword, collapse = '|')))

              # Andhra Pradesh is the highest state with crimes against women another string with something else
              # "crimes against women" "something"





              share|improve this answer




























                2














                We can use stringr which is very handy for string operations and simply use str_extract, i.e.



                str_extract(Tweet, Keyword)
                #[1] "crimes against women"


                For multiple keywords and multiple strings you need to apply, i.e.



                Keyword <- c("crimes against women", "something")
                Tweet <- c("Andhra Pradesh is the highest state with crimes against women",
                "another string with something else")

                sapply(Tweet, function(i)str_extract(i, paste(Keyword, collapse = '|')))

                # Andhra Pradesh is the highest state with crimes against women another string with something else
                # "crimes against women" "something"





                share|improve this answer


























                  2












                  2








                  2







                  We can use stringr which is very handy for string operations and simply use str_extract, i.e.



                  str_extract(Tweet, Keyword)
                  #[1] "crimes against women"


                  For multiple keywords and multiple strings you need to apply, i.e.



                  Keyword <- c("crimes against women", "something")
                  Tweet <- c("Andhra Pradesh is the highest state with crimes against women",
                  "another string with something else")

                  sapply(Tweet, function(i)str_extract(i, paste(Keyword, collapse = '|')))

                  # Andhra Pradesh is the highest state with crimes against women another string with something else
                  # "crimes against women" "something"





                  share|improve this answer













                  We can use stringr which is very handy for string operations and simply use str_extract, i.e.



                  str_extract(Tweet, Keyword)
                  #[1] "crimes against women"


                  For multiple keywords and multiple strings you need to apply, i.e.



                  Keyword <- c("crimes against women", "something")
                  Tweet <- c("Andhra Pradesh is the highest state with crimes against women",
                  "another string with something else")

                  sapply(Tweet, function(i)str_extract(i, paste(Keyword, collapse = '|')))

                  # Andhra Pradesh is the highest state with crimes against women another string with something else
                  # "crimes against women" "something"






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 18 at 12:36









                  SotosSotos

                  29k51640




                  29k51640






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54253780%2fassign-an-id-based-on-keywords-present-in-tweets%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Liquibase includeAll doesn't find base path

                      How to use setInterval in EJS file?

                      Petrus Granier-Deferre