Selecting top n groups with dplyr then plotting other variables












2















I have a dataset where I am trying to select just the top n by counting one category, but then plotting using other variables in the dataset--basically one level of aggregation for the top n, but needing to go back to the full data to plot in ggplot.



So in the problem below, I want the two most common examNames and then plot and facetwrap them by count of year.



ap <- 
tribble(
~year, ~examName,
2014, "Statistics",
2015, "Statistics",
2016, "Statistics",
2016, "Statistics",
2016, "Statistics",
2016, "Statistics",
2017, "Statistics",
2017, "Statistics",
2017, "Statistics",
2017, "Statistics",
2017, "Statistics",
2013, "Macroeconomics",
2013, "Macroeconomics",
2014, "Macroeconomics",
2015, "Macroeconomics",
2016, "Macroeconomics",
2016, "Macroeconomics",
2016, "Macroeconomics",
2016, "Macroeconomics",
2016, "Macroeconomics",
2017, "Macroeconomics",
2017, "Macroeconomics",
2017, "Macroeconomics",
2017, "Macroeconomics",
2017, "Macroeconomics",
2017, "Macroeconomics",
2013, "Calculus",
2014, "Calculus",
2015, "Calculus",
2016, "Calculus",
2017, "Calculus",
2017, "Psychology",
2017, "Psychology",
2017, "Psychology",
2017, "Psychology",
2017, "Psychology",
2018, "Psychology",
2018, "Psychology")


ap_top <- ap %>%
count(examName, sort = TRUE) %>%
head(2) %>%
inner_join(ap, by = "examName") %>%
select(-n)

ap_top %>%
count(examName, year) %>%
ggplot(aes(x = year, y = n, group = examName)) +
geom_line() +
facet_wrap(~ examName)


My thought is to get my top n, then inner_join back on the original dataset. Then plot using that; essentially using the inner join as a filter.



I know there's a better way to do this, and I would love a more elegant solution! I'm all ears! Example dataset given (sorry it's so long).










share|improve this question





























    2















    I have a dataset where I am trying to select just the top n by counting one category, but then plotting using other variables in the dataset--basically one level of aggregation for the top n, but needing to go back to the full data to plot in ggplot.



    So in the problem below, I want the two most common examNames and then plot and facetwrap them by count of year.



    ap <- 
    tribble(
    ~year, ~examName,
    2014, "Statistics",
    2015, "Statistics",
    2016, "Statistics",
    2016, "Statistics",
    2016, "Statistics",
    2016, "Statistics",
    2017, "Statistics",
    2017, "Statistics",
    2017, "Statistics",
    2017, "Statistics",
    2017, "Statistics",
    2013, "Macroeconomics",
    2013, "Macroeconomics",
    2014, "Macroeconomics",
    2015, "Macroeconomics",
    2016, "Macroeconomics",
    2016, "Macroeconomics",
    2016, "Macroeconomics",
    2016, "Macroeconomics",
    2016, "Macroeconomics",
    2017, "Macroeconomics",
    2017, "Macroeconomics",
    2017, "Macroeconomics",
    2017, "Macroeconomics",
    2017, "Macroeconomics",
    2017, "Macroeconomics",
    2013, "Calculus",
    2014, "Calculus",
    2015, "Calculus",
    2016, "Calculus",
    2017, "Calculus",
    2017, "Psychology",
    2017, "Psychology",
    2017, "Psychology",
    2017, "Psychology",
    2017, "Psychology",
    2018, "Psychology",
    2018, "Psychology")


    ap_top <- ap %>%
    count(examName, sort = TRUE) %>%
    head(2) %>%
    inner_join(ap, by = "examName") %>%
    select(-n)

    ap_top %>%
    count(examName, year) %>%
    ggplot(aes(x = year, y = n, group = examName)) +
    geom_line() +
    facet_wrap(~ examName)


    My thought is to get my top n, then inner_join back on the original dataset. Then plot using that; essentially using the inner join as a filter.



    I know there's a better way to do this, and I would love a more elegant solution! I'm all ears! Example dataset given (sorry it's so long).










    share|improve this question



























      2












      2








      2








      I have a dataset where I am trying to select just the top n by counting one category, but then plotting using other variables in the dataset--basically one level of aggregation for the top n, but needing to go back to the full data to plot in ggplot.



      So in the problem below, I want the two most common examNames and then plot and facetwrap them by count of year.



      ap <- 
      tribble(
      ~year, ~examName,
      2014, "Statistics",
      2015, "Statistics",
      2016, "Statistics",
      2016, "Statistics",
      2016, "Statistics",
      2016, "Statistics",
      2017, "Statistics",
      2017, "Statistics",
      2017, "Statistics",
      2017, "Statistics",
      2017, "Statistics",
      2013, "Macroeconomics",
      2013, "Macroeconomics",
      2014, "Macroeconomics",
      2015, "Macroeconomics",
      2016, "Macroeconomics",
      2016, "Macroeconomics",
      2016, "Macroeconomics",
      2016, "Macroeconomics",
      2016, "Macroeconomics",
      2017, "Macroeconomics",
      2017, "Macroeconomics",
      2017, "Macroeconomics",
      2017, "Macroeconomics",
      2017, "Macroeconomics",
      2017, "Macroeconomics",
      2013, "Calculus",
      2014, "Calculus",
      2015, "Calculus",
      2016, "Calculus",
      2017, "Calculus",
      2017, "Psychology",
      2017, "Psychology",
      2017, "Psychology",
      2017, "Psychology",
      2017, "Psychology",
      2018, "Psychology",
      2018, "Psychology")


      ap_top <- ap %>%
      count(examName, sort = TRUE) %>%
      head(2) %>%
      inner_join(ap, by = "examName") %>%
      select(-n)

      ap_top %>%
      count(examName, year) %>%
      ggplot(aes(x = year, y = n, group = examName)) +
      geom_line() +
      facet_wrap(~ examName)


      My thought is to get my top n, then inner_join back on the original dataset. Then plot using that; essentially using the inner join as a filter.



      I know there's a better way to do this, and I would love a more elegant solution! I'm all ears! Example dataset given (sorry it's so long).










      share|improve this question
















      I have a dataset where I am trying to select just the top n by counting one category, but then plotting using other variables in the dataset--basically one level of aggregation for the top n, but needing to go back to the full data to plot in ggplot.



      So in the problem below, I want the two most common examNames and then plot and facetwrap them by count of year.



      ap <- 
      tribble(
      ~year, ~examName,
      2014, "Statistics",
      2015, "Statistics",
      2016, "Statistics",
      2016, "Statistics",
      2016, "Statistics",
      2016, "Statistics",
      2017, "Statistics",
      2017, "Statistics",
      2017, "Statistics",
      2017, "Statistics",
      2017, "Statistics",
      2013, "Macroeconomics",
      2013, "Macroeconomics",
      2014, "Macroeconomics",
      2015, "Macroeconomics",
      2016, "Macroeconomics",
      2016, "Macroeconomics",
      2016, "Macroeconomics",
      2016, "Macroeconomics",
      2016, "Macroeconomics",
      2017, "Macroeconomics",
      2017, "Macroeconomics",
      2017, "Macroeconomics",
      2017, "Macroeconomics",
      2017, "Macroeconomics",
      2017, "Macroeconomics",
      2013, "Calculus",
      2014, "Calculus",
      2015, "Calculus",
      2016, "Calculus",
      2017, "Calculus",
      2017, "Psychology",
      2017, "Psychology",
      2017, "Psychology",
      2017, "Psychology",
      2017, "Psychology",
      2018, "Psychology",
      2018, "Psychology")


      ap_top <- ap %>%
      count(examName, sort = TRUE) %>%
      head(2) %>%
      inner_join(ap, by = "examName") %>%
      select(-n)

      ap_top %>%
      count(examName, year) %>%
      ggplot(aes(x = year, y = n, group = examName)) +
      geom_line() +
      facet_wrap(~ examName)


      My thought is to get my top n, then inner_join back on the original dataset. Then plot using that; essentially using the inner join as a filter.



      I know there's a better way to do this, and I would love a more elegant solution! I'm all ears! Example dataset given (sorry it's so long).







      r ggplot2 dplyr






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 18 at 20:18







      talbe009

















      asked Jan 18 at 20:09









      talbe009talbe009

      344




      344
























          2 Answers
          2






          active

          oldest

          votes


















          5














          You don't need inner_join() I would just determine top two exams in a separate statement and then filter on those.



          top_exams <- count(ap, examName) %>% 
          top_n(2, n) %>% pull(examName)

          ap %>%
          filter(examName %in% top_exams) %>%
          count(year, examName) %>%
          ggplot(aes(x = year, y = n, group = examName)) +
          geom_line() +
          facet_wrap(~ examName)





          share|improve this answer































            2














            Another possibility:



            ap %>% 
            group_by(examName) %>%
            mutate(temp = n()) %>%
            ungroup() %>%
            mutate(temp = dense_rank(desc(temp))) %>%
            filter(temp %in% c(1,2)) %>%
            select(-temp) %>%
            count(year, examName) %>%
            ggplot(aes(x = year, y = n, group = examName)) +
            geom_line() +
            facet_wrap(~ examName)


            It counts the cases per "examName" and ranks the count. Then, it filters the cases that have the greatest and the second greatest count.






            share|improve this answer
























            • What's nice about this solution is that you could do things with the dense_rank, like use it in fct_reorder for sorting in the plot.

              – talbe009
              Jan 18 at 21:01













            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54260789%2fselecting-top-n-groups-with-dplyr-then-plotting-other-variables%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            5














            You don't need inner_join() I would just determine top two exams in a separate statement and then filter on those.



            top_exams <- count(ap, examName) %>% 
            top_n(2, n) %>% pull(examName)

            ap %>%
            filter(examName %in% top_exams) %>%
            count(year, examName) %>%
            ggplot(aes(x = year, y = n, group = examName)) +
            geom_line() +
            facet_wrap(~ examName)





            share|improve this answer




























              5














              You don't need inner_join() I would just determine top two exams in a separate statement and then filter on those.



              top_exams <- count(ap, examName) %>% 
              top_n(2, n) %>% pull(examName)

              ap %>%
              filter(examName %in% top_exams) %>%
              count(year, examName) %>%
              ggplot(aes(x = year, y = n, group = examName)) +
              geom_line() +
              facet_wrap(~ examName)





              share|improve this answer


























                5












                5








                5







                You don't need inner_join() I would just determine top two exams in a separate statement and then filter on those.



                top_exams <- count(ap, examName) %>% 
                top_n(2, n) %>% pull(examName)

                ap %>%
                filter(examName %in% top_exams) %>%
                count(year, examName) %>%
                ggplot(aes(x = year, y = n, group = examName)) +
                geom_line() +
                facet_wrap(~ examName)





                share|improve this answer













                You don't need inner_join() I would just determine top two exams in a separate statement and then filter on those.



                top_exams <- count(ap, examName) %>% 
                top_n(2, n) %>% pull(examName)

                ap %>%
                filter(examName %in% top_exams) %>%
                count(year, examName) %>%
                ggplot(aes(x = year, y = n, group = examName)) +
                geom_line() +
                facet_wrap(~ examName)






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jan 18 at 20:21









                dylanjmdylanjm

                393112




                393112

























                    2














                    Another possibility:



                    ap %>% 
                    group_by(examName) %>%
                    mutate(temp = n()) %>%
                    ungroup() %>%
                    mutate(temp = dense_rank(desc(temp))) %>%
                    filter(temp %in% c(1,2)) %>%
                    select(-temp) %>%
                    count(year, examName) %>%
                    ggplot(aes(x = year, y = n, group = examName)) +
                    geom_line() +
                    facet_wrap(~ examName)


                    It counts the cases per "examName" and ranks the count. Then, it filters the cases that have the greatest and the second greatest count.






                    share|improve this answer
























                    • What's nice about this solution is that you could do things with the dense_rank, like use it in fct_reorder for sorting in the plot.

                      – talbe009
                      Jan 18 at 21:01


















                    2














                    Another possibility:



                    ap %>% 
                    group_by(examName) %>%
                    mutate(temp = n()) %>%
                    ungroup() %>%
                    mutate(temp = dense_rank(desc(temp))) %>%
                    filter(temp %in% c(1,2)) %>%
                    select(-temp) %>%
                    count(year, examName) %>%
                    ggplot(aes(x = year, y = n, group = examName)) +
                    geom_line() +
                    facet_wrap(~ examName)


                    It counts the cases per "examName" and ranks the count. Then, it filters the cases that have the greatest and the second greatest count.






                    share|improve this answer
























                    • What's nice about this solution is that you could do things with the dense_rank, like use it in fct_reorder for sorting in the plot.

                      – talbe009
                      Jan 18 at 21:01
















                    2












                    2








                    2







                    Another possibility:



                    ap %>% 
                    group_by(examName) %>%
                    mutate(temp = n()) %>%
                    ungroup() %>%
                    mutate(temp = dense_rank(desc(temp))) %>%
                    filter(temp %in% c(1,2)) %>%
                    select(-temp) %>%
                    count(year, examName) %>%
                    ggplot(aes(x = year, y = n, group = examName)) +
                    geom_line() +
                    facet_wrap(~ examName)


                    It counts the cases per "examName" and ranks the count. Then, it filters the cases that have the greatest and the second greatest count.






                    share|improve this answer













                    Another possibility:



                    ap %>% 
                    group_by(examName) %>%
                    mutate(temp = n()) %>%
                    ungroup() %>%
                    mutate(temp = dense_rank(desc(temp))) %>%
                    filter(temp %in% c(1,2)) %>%
                    select(-temp) %>%
                    count(year, examName) %>%
                    ggplot(aes(x = year, y = n, group = examName)) +
                    geom_line() +
                    facet_wrap(~ examName)


                    It counts the cases per "examName" and ranks the count. Then, it filters the cases that have the greatest and the second greatest count.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Jan 18 at 20:43









                    tmfmnktmfmnk

                    2,2941412




                    2,2941412













                    • What's nice about this solution is that you could do things with the dense_rank, like use it in fct_reorder for sorting in the plot.

                      – talbe009
                      Jan 18 at 21:01





















                    • What's nice about this solution is that you could do things with the dense_rank, like use it in fct_reorder for sorting in the plot.

                      – talbe009
                      Jan 18 at 21:01



















                    What's nice about this solution is that you could do things with the dense_rank, like use it in fct_reorder for sorting in the plot.

                    – talbe009
                    Jan 18 at 21:01







                    What's nice about this solution is that you could do things with the dense_rank, like use it in fct_reorder for sorting in the plot.

                    – talbe009
                    Jan 18 at 21:01




















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54260789%2fselecting-top-n-groups-with-dplyr-then-plotting-other-variables%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Liquibase includeAll doesn't find base path

                    How to use setInterval in EJS file?

                    Petrus Granier-Deferre