How to subsample different numbers by ID and bootstrap in R












1















First, I'm trying to subsample a large dataset with many individuals, but each individual requires a different subsample size. I'm comparing across two time periods, so I want to subsample each individual by the minimum data points each has across the two periods. Second, I have multiple metrics (mostly various means) to calculate per individual, per time period (I've provided one example below). Third, I want to bootstrap 1,000 reps for those metrics. I also want to do this for the population (by averaging across individuals). I have an example of what I've tried below, but that may be way off. I'm open to functions or for loops - I can't conceptualize which is better for this question. (I apologize ahead of time if my code is not efficient - I'm self taught from googling.)



# Example dataset
Data <- data.frame(
ID = sample(c("A", "B", "C", "D"), 50, replace = TRUE),
Act = sample(c("eat", "sleep", "play"), 50, replace = TRUE),
Period = sample(c("pre", "post"), 50, replace = TRUE)
)

# Separate my data by period
DataPre <- as.data.frame(Data[ which(Data $Period == "pre"), ])
DataPost <- as.data.frame(Data[ which(Data $Period == "post"), ])

# Get the minimum # observations for each ID across both periods
Num <- Data %>%
group_by(ID, Period) %>%
summarise(number=n()) %>%
group_by(ID) %>%
summarise(min=min(number))

# Function to get the mean proportion per ID
meanAct <- function(x){
x %>%
group_by(ID, Act) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
}


Below is how I would subsample if just ONE ID (not many different with different subsampling requirements). I don't know how to specify to subsample different amounts by ID and then replicate each.



# See "8888" Here I want to subsample the Num$Min for each ID
DataResults <- function(x, rep){
reps <- replicate(rep, meanAct(x[sample(1:nrow(x), 8888, replace=FALSE),]))
meanfreq <- apply(simplify2array(reps[3, 1:2]), 1, mean)
sd <- apply(simplify2array(reps[3, 1:2]), 1, sd)
lower <- meanfreq - 1.96*(sd/sqrt(8888))
upper <- meanfreq + 1.96*(sd/sqrt(8888))
meanAct <- as.vector(reps[[1]])
output <- data.frame(meanAct, meanfreq, sd, lower, upper)
print(output)
}

# Print results
DataResults(DataPre, 1000)
DataResults(DataPost, 1000)

# Somehow I get the mean for the population by averaging across all IDs
DataMeanGroup <- DataMean %>%
group_by(Period) %>%
summarise (mean = mean(prop))


The results I'm looking for are the means for each activity for each individual based on subsampling (by minimum datapoints PER INDIVIDUAL) and bootstrapping 1000 reps. Also, if possible, the overall mean for the population by averaging across individuals (again from subsampling and bootstrapping).



EDIT: Additional information:
The ultimate result should allow me to compare the proportion of time that each ID does each activity across the two time periods (e.g. compare % time that A spends eating in the pre vs post, etc). But, subsampled for the period with too much data so that we're comparing an equal number of observations. The way the code would run in my head would be (1) subsample the observations so that we’re comparing an equal number of observations for each ID across the two periods, (2) calculate the proportion of each activity for each ID in each time period, (3) repeat that subsample calculation 1,000 times so that the proportion we end up with is representative of the total observations.










share|improve this question





























    1















    First, I'm trying to subsample a large dataset with many individuals, but each individual requires a different subsample size. I'm comparing across two time periods, so I want to subsample each individual by the minimum data points each has across the two periods. Second, I have multiple metrics (mostly various means) to calculate per individual, per time period (I've provided one example below). Third, I want to bootstrap 1,000 reps for those metrics. I also want to do this for the population (by averaging across individuals). I have an example of what I've tried below, but that may be way off. I'm open to functions or for loops - I can't conceptualize which is better for this question. (I apologize ahead of time if my code is not efficient - I'm self taught from googling.)



    # Example dataset
    Data <- data.frame(
    ID = sample(c("A", "B", "C", "D"), 50, replace = TRUE),
    Act = sample(c("eat", "sleep", "play"), 50, replace = TRUE),
    Period = sample(c("pre", "post"), 50, replace = TRUE)
    )

    # Separate my data by period
    DataPre <- as.data.frame(Data[ which(Data $Period == "pre"), ])
    DataPost <- as.data.frame(Data[ which(Data $Period == "post"), ])

    # Get the minimum # observations for each ID across both periods
    Num <- Data %>%
    group_by(ID, Period) %>%
    summarise(number=n()) %>%
    group_by(ID) %>%
    summarise(min=min(number))

    # Function to get the mean proportion per ID
    meanAct <- function(x){
    x %>%
    group_by(ID, Act) %>%
    summarise (n = n()) %>%
    mutate(freq = n / sum(n))
    }


    Below is how I would subsample if just ONE ID (not many different with different subsampling requirements). I don't know how to specify to subsample different amounts by ID and then replicate each.



    # See "8888" Here I want to subsample the Num$Min for each ID
    DataResults <- function(x, rep){
    reps <- replicate(rep, meanAct(x[sample(1:nrow(x), 8888, replace=FALSE),]))
    meanfreq <- apply(simplify2array(reps[3, 1:2]), 1, mean)
    sd <- apply(simplify2array(reps[3, 1:2]), 1, sd)
    lower <- meanfreq - 1.96*(sd/sqrt(8888))
    upper <- meanfreq + 1.96*(sd/sqrt(8888))
    meanAct <- as.vector(reps[[1]])
    output <- data.frame(meanAct, meanfreq, sd, lower, upper)
    print(output)
    }

    # Print results
    DataResults(DataPre, 1000)
    DataResults(DataPost, 1000)

    # Somehow I get the mean for the population by averaging across all IDs
    DataMeanGroup <- DataMean %>%
    group_by(Period) %>%
    summarise (mean = mean(prop))


    The results I'm looking for are the means for each activity for each individual based on subsampling (by minimum datapoints PER INDIVIDUAL) and bootstrapping 1000 reps. Also, if possible, the overall mean for the population by averaging across individuals (again from subsampling and bootstrapping).



    EDIT: Additional information:
    The ultimate result should allow me to compare the proportion of time that each ID does each activity across the two time periods (e.g. compare % time that A spends eating in the pre vs post, etc). But, subsampled for the period with too much data so that we're comparing an equal number of observations. The way the code would run in my head would be (1) subsample the observations so that we’re comparing an equal number of observations for each ID across the two periods, (2) calculate the proportion of each activity for each ID in each time period, (3) repeat that subsample calculation 1,000 times so that the proportion we end up with is representative of the total observations.










    share|improve this question



























      1












      1








      1








      First, I'm trying to subsample a large dataset with many individuals, but each individual requires a different subsample size. I'm comparing across two time periods, so I want to subsample each individual by the minimum data points each has across the two periods. Second, I have multiple metrics (mostly various means) to calculate per individual, per time period (I've provided one example below). Third, I want to bootstrap 1,000 reps for those metrics. I also want to do this for the population (by averaging across individuals). I have an example of what I've tried below, but that may be way off. I'm open to functions or for loops - I can't conceptualize which is better for this question. (I apologize ahead of time if my code is not efficient - I'm self taught from googling.)



      # Example dataset
      Data <- data.frame(
      ID = sample(c("A", "B", "C", "D"), 50, replace = TRUE),
      Act = sample(c("eat", "sleep", "play"), 50, replace = TRUE),
      Period = sample(c("pre", "post"), 50, replace = TRUE)
      )

      # Separate my data by period
      DataPre <- as.data.frame(Data[ which(Data $Period == "pre"), ])
      DataPost <- as.data.frame(Data[ which(Data $Period == "post"), ])

      # Get the minimum # observations for each ID across both periods
      Num <- Data %>%
      group_by(ID, Period) %>%
      summarise(number=n()) %>%
      group_by(ID) %>%
      summarise(min=min(number))

      # Function to get the mean proportion per ID
      meanAct <- function(x){
      x %>%
      group_by(ID, Act) %>%
      summarise (n = n()) %>%
      mutate(freq = n / sum(n))
      }


      Below is how I would subsample if just ONE ID (not many different with different subsampling requirements). I don't know how to specify to subsample different amounts by ID and then replicate each.



      # See "8888" Here I want to subsample the Num$Min for each ID
      DataResults <- function(x, rep){
      reps <- replicate(rep, meanAct(x[sample(1:nrow(x), 8888, replace=FALSE),]))
      meanfreq <- apply(simplify2array(reps[3, 1:2]), 1, mean)
      sd <- apply(simplify2array(reps[3, 1:2]), 1, sd)
      lower <- meanfreq - 1.96*(sd/sqrt(8888))
      upper <- meanfreq + 1.96*(sd/sqrt(8888))
      meanAct <- as.vector(reps[[1]])
      output <- data.frame(meanAct, meanfreq, sd, lower, upper)
      print(output)
      }

      # Print results
      DataResults(DataPre, 1000)
      DataResults(DataPost, 1000)

      # Somehow I get the mean for the population by averaging across all IDs
      DataMeanGroup <- DataMean %>%
      group_by(Period) %>%
      summarise (mean = mean(prop))


      The results I'm looking for are the means for each activity for each individual based on subsampling (by minimum datapoints PER INDIVIDUAL) and bootstrapping 1000 reps. Also, if possible, the overall mean for the population by averaging across individuals (again from subsampling and bootstrapping).



      EDIT: Additional information:
      The ultimate result should allow me to compare the proportion of time that each ID does each activity across the two time periods (e.g. compare % time that A spends eating in the pre vs post, etc). But, subsampled for the period with too much data so that we're comparing an equal number of observations. The way the code would run in my head would be (1) subsample the observations so that we’re comparing an equal number of observations for each ID across the two periods, (2) calculate the proportion of each activity for each ID in each time period, (3) repeat that subsample calculation 1,000 times so that the proportion we end up with is representative of the total observations.










      share|improve this question
















      First, I'm trying to subsample a large dataset with many individuals, but each individual requires a different subsample size. I'm comparing across two time periods, so I want to subsample each individual by the minimum data points each has across the two periods. Second, I have multiple metrics (mostly various means) to calculate per individual, per time period (I've provided one example below). Third, I want to bootstrap 1,000 reps for those metrics. I also want to do this for the population (by averaging across individuals). I have an example of what I've tried below, but that may be way off. I'm open to functions or for loops - I can't conceptualize which is better for this question. (I apologize ahead of time if my code is not efficient - I'm self taught from googling.)



      # Example dataset
      Data <- data.frame(
      ID = sample(c("A", "B", "C", "D"), 50, replace = TRUE),
      Act = sample(c("eat", "sleep", "play"), 50, replace = TRUE),
      Period = sample(c("pre", "post"), 50, replace = TRUE)
      )

      # Separate my data by period
      DataPre <- as.data.frame(Data[ which(Data $Period == "pre"), ])
      DataPost <- as.data.frame(Data[ which(Data $Period == "post"), ])

      # Get the minimum # observations for each ID across both periods
      Num <- Data %>%
      group_by(ID, Period) %>%
      summarise(number=n()) %>%
      group_by(ID) %>%
      summarise(min=min(number))

      # Function to get the mean proportion per ID
      meanAct <- function(x){
      x %>%
      group_by(ID, Act) %>%
      summarise (n = n()) %>%
      mutate(freq = n / sum(n))
      }


      Below is how I would subsample if just ONE ID (not many different with different subsampling requirements). I don't know how to specify to subsample different amounts by ID and then replicate each.



      # See "8888" Here I want to subsample the Num$Min for each ID
      DataResults <- function(x, rep){
      reps <- replicate(rep, meanAct(x[sample(1:nrow(x), 8888, replace=FALSE),]))
      meanfreq <- apply(simplify2array(reps[3, 1:2]), 1, mean)
      sd <- apply(simplify2array(reps[3, 1:2]), 1, sd)
      lower <- meanfreq - 1.96*(sd/sqrt(8888))
      upper <- meanfreq + 1.96*(sd/sqrt(8888))
      meanAct <- as.vector(reps[[1]])
      output <- data.frame(meanAct, meanfreq, sd, lower, upper)
      print(output)
      }

      # Print results
      DataResults(DataPre, 1000)
      DataResults(DataPost, 1000)

      # Somehow I get the mean for the population by averaging across all IDs
      DataMeanGroup <- DataMean %>%
      group_by(Period) %>%
      summarise (mean = mean(prop))


      The results I'm looking for are the means for each activity for each individual based on subsampling (by minimum datapoints PER INDIVIDUAL) and bootstrapping 1000 reps. Also, if possible, the overall mean for the population by averaging across individuals (again from subsampling and bootstrapping).



      EDIT: Additional information:
      The ultimate result should allow me to compare the proportion of time that each ID does each activity across the two time periods (e.g. compare % time that A spends eating in the pre vs post, etc). But, subsampled for the period with too much data so that we're comparing an equal number of observations. The way the code would run in my head would be (1) subsample the observations so that we’re comparing an equal number of observations for each ID across the two periods, (2) calculate the proportion of each activity for each ID in each time period, (3) repeat that subsample calculation 1,000 times so that the proportion we end up with is representative of the total observations.







      r function for-loop bootstrapping subsampling






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 20 at 2:45







      user9351962

















      asked Jan 19 at 3:18









      user9351962user9351962

      255




      255
























          1 Answer
          1






          active

          oldest

          votes


















          1














          Consider generalizing your subsampling function to receive subsets of data frame passed in by which can slice data frame by each unique pairing of ID and Period. But first calculate MinNum by each ID and Period using ave (inline aggregation). All code below uses base R (i.e., no other package):



          Data and Functions



          # Example dataset (WITH MORE ROWS)
          set.seed(11919)
          Data <- data.frame(
          ID = sample(c("A", "B", "C", "D"), 500, replace = TRUE),
          Act = sample(c("eat", "sleep", "play"), 500, replace = TRUE),
          Period = sample(c("pre", "post"), 500, replace = TRUE)
          )

          # MIN NUM PER ID AND PERIOD GROUPING (NESTED ave FOR COUNT AND MIN AGGREGATIONS)
          Data$Min_Num <- with(Data, ave(ave(1:nrow(Data), ID, Period, FUN=length), ID, FUN=min))

          # Function to get the mean proportion per ID
          meanAct <- function(x){
          within(x, {
          n <- ave(1:nrow(x), ID, Act, FUN=length)
          freq <- n / sum(n)
          })
          }

          DataResults <- function(df, rep){
          reps <- replicate(rep, meanAct(df[sample(1:nrow(df), df$Min_Num[1], replace=FALSE),]))
          mean_freq <- apply(simplify2array(reps["freq", ]), 1, mean) # ADJUSTED INDEXING
          sd <- apply(simplify2array(reps["freq", ]), 1, sd) # ADJUSTED INDEXING
          lower <- mean_freq - 1.96*(sd/sqrt(df$Min_Num[1]))
          upper <- mean_freq + 1.96*(sd/sqrt(df$Min_Num[1]))
          mean_act <- as.vector(reps[[2]]) # ADJUSTED [[#]] NUMBER
          id <- df$ID[1] # ADD GROUP INDICATOR
          period <- df$Period[1] # ADD GROUP INDICATOR

          output <- data.frame(id, period, mean_act, mean_freq, sd, lower, upper)
          return(output)
          }


          Processing



          # BY CALL
          df_list <- by(Data, Data[c("ID", "Period")], function(sub) DataResults(sub, 1000))

          # BIND ALL DFs INTO ONE DF
          final_df <- do.call(rbind, df_list)
          head(final_df, 10)
          # id period mean_act mean_freq sd lower upper
          # 1 A post sleep 0.02157354 0.005704140 0.01992512 0.02322196
          # 2 A post eat 0.02151701 0.005720058 0.01986399 0.02317003
          # 3 A post sleep 0.02171393 0.005808156 0.02003546 0.02339241
          # 4 A post eat 0.02164184 0.005716603 0.01998982 0.02329386
          # 5 A post play 0.02174095 0.005678416 0.02009996 0.02338193
          # 6 A post eat 0.02181380 0.005716590 0.02016178 0.02346581
          # 7 A post sleep 0.02172458 0.005691051 0.02007995 0.02336922
          # 8 A post sleep 0.02174288 0.005666839 0.02010524 0.02338052
          # 9 A post play 0.02166234 0.005673047 0.02002291 0.02330177
          # 10 A post play 0.02185057 0.005813680 0.02017050 0.02353065


          Summarization



          # SUMMARIZE FINAL DF (MEAN PROP BY ID AND ACT)
          agg_df <- aggregate(mean_freq ~ id + mean_act, final_df, mean)
          agg_df
          # id mean_act mean_freq
          # 1 A eat 0.02172782
          # 2 B eat 0.01469706
          # 3 C eat 0.01814771
          # 4 D eat 0.01696995
          # 5 A play 0.02178283
          # 6 B play 0.01471497
          # 7 C play 0.01819898
          # 8 D play 0.01688828
          # 9 A sleep 0.02169912
          # 10 B sleep 0.01470978
          # 11 C sleep 0.01818944
          # 12 D sleep 0.01697438

          # SUMMARIZE FINAL DF (MEAN PROP BY ID AND PERIOD)
          agg_df <- aggregate(mean_freq ~ id + period, final_df, mean)
          agg_df
          # id period mean_freq
          # 1 A post 0.02173913
          # 2 B post 0.01470588
          # 3 C post 0.01818182
          # 4 D post 0.01694915
          # 5 A pre 0.02173913
          # 6 B pre 0.01470588
          # 7 C pre 0.01818182
          # 8 D pre 0.01694915

          # SUMMARIZE FINAL DF (MEAN PROP BY ID)
          agg_df <- aggregate(mean_freq ~ id, final_df, mean)
          agg_df
          # id mean_freq
          # 1 A 0.02173913
          # 2 B 0.01470588
          # 3 C 0.01818182
          # 4 D 0.01694915





          share|improve this answer
























          • thank you so much for your thoughtful and thorough response. I'm still working my way through it and trying to make sure I understand everything. But I wanted to acknowledge your response and tell you how appreciative I am! I've added an "EDITED" section above to clarify the question/result - to make sure the question that I'm trying to answer is in fact the question I asked (and the code I provided). And to confirm that I'm understanding the output of the code above. I do have do questions below:

            – user9351962
            Jan 20 at 2:42













          • For the meanAct function – should Period also be included? Or is it okay that Period is left out here because later Period is in the BY CALL section? Freq seems slow to me – shouldn’t it be the proportion of each activity per ID per Period? Here it looks like is the proportion out of the entire dataset. Is that right?

            – user9351962
            Jan 20 at 2:43











          • The other piece I’m trying to make sure I understand is in the DataResults function. Here it looks like we’re subsampling the mean proportion per ID from meanAct. Is that right? Or is it actively calculating a mean proportion per ID per activity per period for every subsample (1000 times)? The second is what I thought it would do – but maybe those give you the same result?

            – user9351962
            Jan 20 at 2:43











          • I cannot answer all your questions as it sounds more like you need formulating your needs beyond programming. Here by splits your data frame into smaller data frames filtered to every combination of ID and Period (if needed add Activity) and passes subsets into DataResults. So all operations within the function including meanAct works on filtered subsets (never the whole data frame).

            – Parfait
            Jan 21 at 15:15











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54263790%2fhow-to-subsample-different-numbers-by-id-and-bootstrap-in-r%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          Consider generalizing your subsampling function to receive subsets of data frame passed in by which can slice data frame by each unique pairing of ID and Period. But first calculate MinNum by each ID and Period using ave (inline aggregation). All code below uses base R (i.e., no other package):



          Data and Functions



          # Example dataset (WITH MORE ROWS)
          set.seed(11919)
          Data <- data.frame(
          ID = sample(c("A", "B", "C", "D"), 500, replace = TRUE),
          Act = sample(c("eat", "sleep", "play"), 500, replace = TRUE),
          Period = sample(c("pre", "post"), 500, replace = TRUE)
          )

          # MIN NUM PER ID AND PERIOD GROUPING (NESTED ave FOR COUNT AND MIN AGGREGATIONS)
          Data$Min_Num <- with(Data, ave(ave(1:nrow(Data), ID, Period, FUN=length), ID, FUN=min))

          # Function to get the mean proportion per ID
          meanAct <- function(x){
          within(x, {
          n <- ave(1:nrow(x), ID, Act, FUN=length)
          freq <- n / sum(n)
          })
          }

          DataResults <- function(df, rep){
          reps <- replicate(rep, meanAct(df[sample(1:nrow(df), df$Min_Num[1], replace=FALSE),]))
          mean_freq <- apply(simplify2array(reps["freq", ]), 1, mean) # ADJUSTED INDEXING
          sd <- apply(simplify2array(reps["freq", ]), 1, sd) # ADJUSTED INDEXING
          lower <- mean_freq - 1.96*(sd/sqrt(df$Min_Num[1]))
          upper <- mean_freq + 1.96*(sd/sqrt(df$Min_Num[1]))
          mean_act <- as.vector(reps[[2]]) # ADJUSTED [[#]] NUMBER
          id <- df$ID[1] # ADD GROUP INDICATOR
          period <- df$Period[1] # ADD GROUP INDICATOR

          output <- data.frame(id, period, mean_act, mean_freq, sd, lower, upper)
          return(output)
          }


          Processing



          # BY CALL
          df_list <- by(Data, Data[c("ID", "Period")], function(sub) DataResults(sub, 1000))

          # BIND ALL DFs INTO ONE DF
          final_df <- do.call(rbind, df_list)
          head(final_df, 10)
          # id period mean_act mean_freq sd lower upper
          # 1 A post sleep 0.02157354 0.005704140 0.01992512 0.02322196
          # 2 A post eat 0.02151701 0.005720058 0.01986399 0.02317003
          # 3 A post sleep 0.02171393 0.005808156 0.02003546 0.02339241
          # 4 A post eat 0.02164184 0.005716603 0.01998982 0.02329386
          # 5 A post play 0.02174095 0.005678416 0.02009996 0.02338193
          # 6 A post eat 0.02181380 0.005716590 0.02016178 0.02346581
          # 7 A post sleep 0.02172458 0.005691051 0.02007995 0.02336922
          # 8 A post sleep 0.02174288 0.005666839 0.02010524 0.02338052
          # 9 A post play 0.02166234 0.005673047 0.02002291 0.02330177
          # 10 A post play 0.02185057 0.005813680 0.02017050 0.02353065


          Summarization



          # SUMMARIZE FINAL DF (MEAN PROP BY ID AND ACT)
          agg_df <- aggregate(mean_freq ~ id + mean_act, final_df, mean)
          agg_df
          # id mean_act mean_freq
          # 1 A eat 0.02172782
          # 2 B eat 0.01469706
          # 3 C eat 0.01814771
          # 4 D eat 0.01696995
          # 5 A play 0.02178283
          # 6 B play 0.01471497
          # 7 C play 0.01819898
          # 8 D play 0.01688828
          # 9 A sleep 0.02169912
          # 10 B sleep 0.01470978
          # 11 C sleep 0.01818944
          # 12 D sleep 0.01697438

          # SUMMARIZE FINAL DF (MEAN PROP BY ID AND PERIOD)
          agg_df <- aggregate(mean_freq ~ id + period, final_df, mean)
          agg_df
          # id period mean_freq
          # 1 A post 0.02173913
          # 2 B post 0.01470588
          # 3 C post 0.01818182
          # 4 D post 0.01694915
          # 5 A pre 0.02173913
          # 6 B pre 0.01470588
          # 7 C pre 0.01818182
          # 8 D pre 0.01694915

          # SUMMARIZE FINAL DF (MEAN PROP BY ID)
          agg_df <- aggregate(mean_freq ~ id, final_df, mean)
          agg_df
          # id mean_freq
          # 1 A 0.02173913
          # 2 B 0.01470588
          # 3 C 0.01818182
          # 4 D 0.01694915





          share|improve this answer
























          • thank you so much for your thoughtful and thorough response. I'm still working my way through it and trying to make sure I understand everything. But I wanted to acknowledge your response and tell you how appreciative I am! I've added an "EDITED" section above to clarify the question/result - to make sure the question that I'm trying to answer is in fact the question I asked (and the code I provided). And to confirm that I'm understanding the output of the code above. I do have do questions below:

            – user9351962
            Jan 20 at 2:42













          • For the meanAct function – should Period also be included? Or is it okay that Period is left out here because later Period is in the BY CALL section? Freq seems slow to me – shouldn’t it be the proportion of each activity per ID per Period? Here it looks like is the proportion out of the entire dataset. Is that right?

            – user9351962
            Jan 20 at 2:43











          • The other piece I’m trying to make sure I understand is in the DataResults function. Here it looks like we’re subsampling the mean proportion per ID from meanAct. Is that right? Or is it actively calculating a mean proportion per ID per activity per period for every subsample (1000 times)? The second is what I thought it would do – but maybe those give you the same result?

            – user9351962
            Jan 20 at 2:43











          • I cannot answer all your questions as it sounds more like you need formulating your needs beyond programming. Here by splits your data frame into smaller data frames filtered to every combination of ID and Period (if needed add Activity) and passes subsets into DataResults. So all operations within the function including meanAct works on filtered subsets (never the whole data frame).

            – Parfait
            Jan 21 at 15:15
















          1














          Consider generalizing your subsampling function to receive subsets of data frame passed in by which can slice data frame by each unique pairing of ID and Period. But first calculate MinNum by each ID and Period using ave (inline aggregation). All code below uses base R (i.e., no other package):



          Data and Functions



          # Example dataset (WITH MORE ROWS)
          set.seed(11919)
          Data <- data.frame(
          ID = sample(c("A", "B", "C", "D"), 500, replace = TRUE),
          Act = sample(c("eat", "sleep", "play"), 500, replace = TRUE),
          Period = sample(c("pre", "post"), 500, replace = TRUE)
          )

          # MIN NUM PER ID AND PERIOD GROUPING (NESTED ave FOR COUNT AND MIN AGGREGATIONS)
          Data$Min_Num <- with(Data, ave(ave(1:nrow(Data), ID, Period, FUN=length), ID, FUN=min))

          # Function to get the mean proportion per ID
          meanAct <- function(x){
          within(x, {
          n <- ave(1:nrow(x), ID, Act, FUN=length)
          freq <- n / sum(n)
          })
          }

          DataResults <- function(df, rep){
          reps <- replicate(rep, meanAct(df[sample(1:nrow(df), df$Min_Num[1], replace=FALSE),]))
          mean_freq <- apply(simplify2array(reps["freq", ]), 1, mean) # ADJUSTED INDEXING
          sd <- apply(simplify2array(reps["freq", ]), 1, sd) # ADJUSTED INDEXING
          lower <- mean_freq - 1.96*(sd/sqrt(df$Min_Num[1]))
          upper <- mean_freq + 1.96*(sd/sqrt(df$Min_Num[1]))
          mean_act <- as.vector(reps[[2]]) # ADJUSTED [[#]] NUMBER
          id <- df$ID[1] # ADD GROUP INDICATOR
          period <- df$Period[1] # ADD GROUP INDICATOR

          output <- data.frame(id, period, mean_act, mean_freq, sd, lower, upper)
          return(output)
          }


          Processing



          # BY CALL
          df_list <- by(Data, Data[c("ID", "Period")], function(sub) DataResults(sub, 1000))

          # BIND ALL DFs INTO ONE DF
          final_df <- do.call(rbind, df_list)
          head(final_df, 10)
          # id period mean_act mean_freq sd lower upper
          # 1 A post sleep 0.02157354 0.005704140 0.01992512 0.02322196
          # 2 A post eat 0.02151701 0.005720058 0.01986399 0.02317003
          # 3 A post sleep 0.02171393 0.005808156 0.02003546 0.02339241
          # 4 A post eat 0.02164184 0.005716603 0.01998982 0.02329386
          # 5 A post play 0.02174095 0.005678416 0.02009996 0.02338193
          # 6 A post eat 0.02181380 0.005716590 0.02016178 0.02346581
          # 7 A post sleep 0.02172458 0.005691051 0.02007995 0.02336922
          # 8 A post sleep 0.02174288 0.005666839 0.02010524 0.02338052
          # 9 A post play 0.02166234 0.005673047 0.02002291 0.02330177
          # 10 A post play 0.02185057 0.005813680 0.02017050 0.02353065


          Summarization



          # SUMMARIZE FINAL DF (MEAN PROP BY ID AND ACT)
          agg_df <- aggregate(mean_freq ~ id + mean_act, final_df, mean)
          agg_df
          # id mean_act mean_freq
          # 1 A eat 0.02172782
          # 2 B eat 0.01469706
          # 3 C eat 0.01814771
          # 4 D eat 0.01696995
          # 5 A play 0.02178283
          # 6 B play 0.01471497
          # 7 C play 0.01819898
          # 8 D play 0.01688828
          # 9 A sleep 0.02169912
          # 10 B sleep 0.01470978
          # 11 C sleep 0.01818944
          # 12 D sleep 0.01697438

          # SUMMARIZE FINAL DF (MEAN PROP BY ID AND PERIOD)
          agg_df <- aggregate(mean_freq ~ id + period, final_df, mean)
          agg_df
          # id period mean_freq
          # 1 A post 0.02173913
          # 2 B post 0.01470588
          # 3 C post 0.01818182
          # 4 D post 0.01694915
          # 5 A pre 0.02173913
          # 6 B pre 0.01470588
          # 7 C pre 0.01818182
          # 8 D pre 0.01694915

          # SUMMARIZE FINAL DF (MEAN PROP BY ID)
          agg_df <- aggregate(mean_freq ~ id, final_df, mean)
          agg_df
          # id mean_freq
          # 1 A 0.02173913
          # 2 B 0.01470588
          # 3 C 0.01818182
          # 4 D 0.01694915





          share|improve this answer
























          • thank you so much for your thoughtful and thorough response. I'm still working my way through it and trying to make sure I understand everything. But I wanted to acknowledge your response and tell you how appreciative I am! I've added an "EDITED" section above to clarify the question/result - to make sure the question that I'm trying to answer is in fact the question I asked (and the code I provided). And to confirm that I'm understanding the output of the code above. I do have do questions below:

            – user9351962
            Jan 20 at 2:42













          • For the meanAct function – should Period also be included? Or is it okay that Period is left out here because later Period is in the BY CALL section? Freq seems slow to me – shouldn’t it be the proportion of each activity per ID per Period? Here it looks like is the proportion out of the entire dataset. Is that right?

            – user9351962
            Jan 20 at 2:43











          • The other piece I’m trying to make sure I understand is in the DataResults function. Here it looks like we’re subsampling the mean proportion per ID from meanAct. Is that right? Or is it actively calculating a mean proportion per ID per activity per period for every subsample (1000 times)? The second is what I thought it would do – but maybe those give you the same result?

            – user9351962
            Jan 20 at 2:43











          • I cannot answer all your questions as it sounds more like you need formulating your needs beyond programming. Here by splits your data frame into smaller data frames filtered to every combination of ID and Period (if needed add Activity) and passes subsets into DataResults. So all operations within the function including meanAct works on filtered subsets (never the whole data frame).

            – Parfait
            Jan 21 at 15:15














          1












          1








          1







          Consider generalizing your subsampling function to receive subsets of data frame passed in by which can slice data frame by each unique pairing of ID and Period. But first calculate MinNum by each ID and Period using ave (inline aggregation). All code below uses base R (i.e., no other package):



          Data and Functions



          # Example dataset (WITH MORE ROWS)
          set.seed(11919)
          Data <- data.frame(
          ID = sample(c("A", "B", "C", "D"), 500, replace = TRUE),
          Act = sample(c("eat", "sleep", "play"), 500, replace = TRUE),
          Period = sample(c("pre", "post"), 500, replace = TRUE)
          )

          # MIN NUM PER ID AND PERIOD GROUPING (NESTED ave FOR COUNT AND MIN AGGREGATIONS)
          Data$Min_Num <- with(Data, ave(ave(1:nrow(Data), ID, Period, FUN=length), ID, FUN=min))

          # Function to get the mean proportion per ID
          meanAct <- function(x){
          within(x, {
          n <- ave(1:nrow(x), ID, Act, FUN=length)
          freq <- n / sum(n)
          })
          }

          DataResults <- function(df, rep){
          reps <- replicate(rep, meanAct(df[sample(1:nrow(df), df$Min_Num[1], replace=FALSE),]))
          mean_freq <- apply(simplify2array(reps["freq", ]), 1, mean) # ADJUSTED INDEXING
          sd <- apply(simplify2array(reps["freq", ]), 1, sd) # ADJUSTED INDEXING
          lower <- mean_freq - 1.96*(sd/sqrt(df$Min_Num[1]))
          upper <- mean_freq + 1.96*(sd/sqrt(df$Min_Num[1]))
          mean_act <- as.vector(reps[[2]]) # ADJUSTED [[#]] NUMBER
          id <- df$ID[1] # ADD GROUP INDICATOR
          period <- df$Period[1] # ADD GROUP INDICATOR

          output <- data.frame(id, period, mean_act, mean_freq, sd, lower, upper)
          return(output)
          }


          Processing



          # BY CALL
          df_list <- by(Data, Data[c("ID", "Period")], function(sub) DataResults(sub, 1000))

          # BIND ALL DFs INTO ONE DF
          final_df <- do.call(rbind, df_list)
          head(final_df, 10)
          # id period mean_act mean_freq sd lower upper
          # 1 A post sleep 0.02157354 0.005704140 0.01992512 0.02322196
          # 2 A post eat 0.02151701 0.005720058 0.01986399 0.02317003
          # 3 A post sleep 0.02171393 0.005808156 0.02003546 0.02339241
          # 4 A post eat 0.02164184 0.005716603 0.01998982 0.02329386
          # 5 A post play 0.02174095 0.005678416 0.02009996 0.02338193
          # 6 A post eat 0.02181380 0.005716590 0.02016178 0.02346581
          # 7 A post sleep 0.02172458 0.005691051 0.02007995 0.02336922
          # 8 A post sleep 0.02174288 0.005666839 0.02010524 0.02338052
          # 9 A post play 0.02166234 0.005673047 0.02002291 0.02330177
          # 10 A post play 0.02185057 0.005813680 0.02017050 0.02353065


          Summarization



          # SUMMARIZE FINAL DF (MEAN PROP BY ID AND ACT)
          agg_df <- aggregate(mean_freq ~ id + mean_act, final_df, mean)
          agg_df
          # id mean_act mean_freq
          # 1 A eat 0.02172782
          # 2 B eat 0.01469706
          # 3 C eat 0.01814771
          # 4 D eat 0.01696995
          # 5 A play 0.02178283
          # 6 B play 0.01471497
          # 7 C play 0.01819898
          # 8 D play 0.01688828
          # 9 A sleep 0.02169912
          # 10 B sleep 0.01470978
          # 11 C sleep 0.01818944
          # 12 D sleep 0.01697438

          # SUMMARIZE FINAL DF (MEAN PROP BY ID AND PERIOD)
          agg_df <- aggregate(mean_freq ~ id + period, final_df, mean)
          agg_df
          # id period mean_freq
          # 1 A post 0.02173913
          # 2 B post 0.01470588
          # 3 C post 0.01818182
          # 4 D post 0.01694915
          # 5 A pre 0.02173913
          # 6 B pre 0.01470588
          # 7 C pre 0.01818182
          # 8 D pre 0.01694915

          # SUMMARIZE FINAL DF (MEAN PROP BY ID)
          agg_df <- aggregate(mean_freq ~ id, final_df, mean)
          agg_df
          # id mean_freq
          # 1 A 0.02173913
          # 2 B 0.01470588
          # 3 C 0.01818182
          # 4 D 0.01694915





          share|improve this answer













          Consider generalizing your subsampling function to receive subsets of data frame passed in by which can slice data frame by each unique pairing of ID and Period. But first calculate MinNum by each ID and Period using ave (inline aggregation). All code below uses base R (i.e., no other package):



          Data and Functions



          # Example dataset (WITH MORE ROWS)
          set.seed(11919)
          Data <- data.frame(
          ID = sample(c("A", "B", "C", "D"), 500, replace = TRUE),
          Act = sample(c("eat", "sleep", "play"), 500, replace = TRUE),
          Period = sample(c("pre", "post"), 500, replace = TRUE)
          )

          # MIN NUM PER ID AND PERIOD GROUPING (NESTED ave FOR COUNT AND MIN AGGREGATIONS)
          Data$Min_Num <- with(Data, ave(ave(1:nrow(Data), ID, Period, FUN=length), ID, FUN=min))

          # Function to get the mean proportion per ID
          meanAct <- function(x){
          within(x, {
          n <- ave(1:nrow(x), ID, Act, FUN=length)
          freq <- n / sum(n)
          })
          }

          DataResults <- function(df, rep){
          reps <- replicate(rep, meanAct(df[sample(1:nrow(df), df$Min_Num[1], replace=FALSE),]))
          mean_freq <- apply(simplify2array(reps["freq", ]), 1, mean) # ADJUSTED INDEXING
          sd <- apply(simplify2array(reps["freq", ]), 1, sd) # ADJUSTED INDEXING
          lower <- mean_freq - 1.96*(sd/sqrt(df$Min_Num[1]))
          upper <- mean_freq + 1.96*(sd/sqrt(df$Min_Num[1]))
          mean_act <- as.vector(reps[[2]]) # ADJUSTED [[#]] NUMBER
          id <- df$ID[1] # ADD GROUP INDICATOR
          period <- df$Period[1] # ADD GROUP INDICATOR

          output <- data.frame(id, period, mean_act, mean_freq, sd, lower, upper)
          return(output)
          }


          Processing



          # BY CALL
          df_list <- by(Data, Data[c("ID", "Period")], function(sub) DataResults(sub, 1000))

          # BIND ALL DFs INTO ONE DF
          final_df <- do.call(rbind, df_list)
          head(final_df, 10)
          # id period mean_act mean_freq sd lower upper
          # 1 A post sleep 0.02157354 0.005704140 0.01992512 0.02322196
          # 2 A post eat 0.02151701 0.005720058 0.01986399 0.02317003
          # 3 A post sleep 0.02171393 0.005808156 0.02003546 0.02339241
          # 4 A post eat 0.02164184 0.005716603 0.01998982 0.02329386
          # 5 A post play 0.02174095 0.005678416 0.02009996 0.02338193
          # 6 A post eat 0.02181380 0.005716590 0.02016178 0.02346581
          # 7 A post sleep 0.02172458 0.005691051 0.02007995 0.02336922
          # 8 A post sleep 0.02174288 0.005666839 0.02010524 0.02338052
          # 9 A post play 0.02166234 0.005673047 0.02002291 0.02330177
          # 10 A post play 0.02185057 0.005813680 0.02017050 0.02353065


          Summarization



          # SUMMARIZE FINAL DF (MEAN PROP BY ID AND ACT)
          agg_df <- aggregate(mean_freq ~ id + mean_act, final_df, mean)
          agg_df
          # id mean_act mean_freq
          # 1 A eat 0.02172782
          # 2 B eat 0.01469706
          # 3 C eat 0.01814771
          # 4 D eat 0.01696995
          # 5 A play 0.02178283
          # 6 B play 0.01471497
          # 7 C play 0.01819898
          # 8 D play 0.01688828
          # 9 A sleep 0.02169912
          # 10 B sleep 0.01470978
          # 11 C sleep 0.01818944
          # 12 D sleep 0.01697438

          # SUMMARIZE FINAL DF (MEAN PROP BY ID AND PERIOD)
          agg_df <- aggregate(mean_freq ~ id + period, final_df, mean)
          agg_df
          # id period mean_freq
          # 1 A post 0.02173913
          # 2 B post 0.01470588
          # 3 C post 0.01818182
          # 4 D post 0.01694915
          # 5 A pre 0.02173913
          # 6 B pre 0.01470588
          # 7 C pre 0.01818182
          # 8 D pre 0.01694915

          # SUMMARIZE FINAL DF (MEAN PROP BY ID)
          agg_df <- aggregate(mean_freq ~ id, final_df, mean)
          agg_df
          # id mean_freq
          # 1 A 0.02173913
          # 2 B 0.01470588
          # 3 C 0.01818182
          # 4 D 0.01694915






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 19 at 16:59









          ParfaitParfait

          51k84270




          51k84270













          • thank you so much for your thoughtful and thorough response. I'm still working my way through it and trying to make sure I understand everything. But I wanted to acknowledge your response and tell you how appreciative I am! I've added an "EDITED" section above to clarify the question/result - to make sure the question that I'm trying to answer is in fact the question I asked (and the code I provided). And to confirm that I'm understanding the output of the code above. I do have do questions below:

            – user9351962
            Jan 20 at 2:42













          • For the meanAct function – should Period also be included? Or is it okay that Period is left out here because later Period is in the BY CALL section? Freq seems slow to me – shouldn’t it be the proportion of each activity per ID per Period? Here it looks like is the proportion out of the entire dataset. Is that right?

            – user9351962
            Jan 20 at 2:43











          • The other piece I’m trying to make sure I understand is in the DataResults function. Here it looks like we’re subsampling the mean proportion per ID from meanAct. Is that right? Or is it actively calculating a mean proportion per ID per activity per period for every subsample (1000 times)? The second is what I thought it would do – but maybe those give you the same result?

            – user9351962
            Jan 20 at 2:43











          • I cannot answer all your questions as it sounds more like you need formulating your needs beyond programming. Here by splits your data frame into smaller data frames filtered to every combination of ID and Period (if needed add Activity) and passes subsets into DataResults. So all operations within the function including meanAct works on filtered subsets (never the whole data frame).

            – Parfait
            Jan 21 at 15:15



















          • thank you so much for your thoughtful and thorough response. I'm still working my way through it and trying to make sure I understand everything. But I wanted to acknowledge your response and tell you how appreciative I am! I've added an "EDITED" section above to clarify the question/result - to make sure the question that I'm trying to answer is in fact the question I asked (and the code I provided). And to confirm that I'm understanding the output of the code above. I do have do questions below:

            – user9351962
            Jan 20 at 2:42













          • For the meanAct function – should Period also be included? Or is it okay that Period is left out here because later Period is in the BY CALL section? Freq seems slow to me – shouldn’t it be the proportion of each activity per ID per Period? Here it looks like is the proportion out of the entire dataset. Is that right?

            – user9351962
            Jan 20 at 2:43











          • The other piece I’m trying to make sure I understand is in the DataResults function. Here it looks like we’re subsampling the mean proportion per ID from meanAct. Is that right? Or is it actively calculating a mean proportion per ID per activity per period for every subsample (1000 times)? The second is what I thought it would do – but maybe those give you the same result?

            – user9351962
            Jan 20 at 2:43











          • I cannot answer all your questions as it sounds more like you need formulating your needs beyond programming. Here by splits your data frame into smaller data frames filtered to every combination of ID and Period (if needed add Activity) and passes subsets into DataResults. So all operations within the function including meanAct works on filtered subsets (never the whole data frame).

            – Parfait
            Jan 21 at 15:15

















          thank you so much for your thoughtful and thorough response. I'm still working my way through it and trying to make sure I understand everything. But I wanted to acknowledge your response and tell you how appreciative I am! I've added an "EDITED" section above to clarify the question/result - to make sure the question that I'm trying to answer is in fact the question I asked (and the code I provided). And to confirm that I'm understanding the output of the code above. I do have do questions below:

          – user9351962
          Jan 20 at 2:42







          thank you so much for your thoughtful and thorough response. I'm still working my way through it and trying to make sure I understand everything. But I wanted to acknowledge your response and tell you how appreciative I am! I've added an "EDITED" section above to clarify the question/result - to make sure the question that I'm trying to answer is in fact the question I asked (and the code I provided). And to confirm that I'm understanding the output of the code above. I do have do questions below:

          – user9351962
          Jan 20 at 2:42















          For the meanAct function – should Period also be included? Or is it okay that Period is left out here because later Period is in the BY CALL section? Freq seems slow to me – shouldn’t it be the proportion of each activity per ID per Period? Here it looks like is the proportion out of the entire dataset. Is that right?

          – user9351962
          Jan 20 at 2:43





          For the meanAct function – should Period also be included? Or is it okay that Period is left out here because later Period is in the BY CALL section? Freq seems slow to me – shouldn’t it be the proportion of each activity per ID per Period? Here it looks like is the proportion out of the entire dataset. Is that right?

          – user9351962
          Jan 20 at 2:43













          The other piece I’m trying to make sure I understand is in the DataResults function. Here it looks like we’re subsampling the mean proportion per ID from meanAct. Is that right? Or is it actively calculating a mean proportion per ID per activity per period for every subsample (1000 times)? The second is what I thought it would do – but maybe those give you the same result?

          – user9351962
          Jan 20 at 2:43





          The other piece I’m trying to make sure I understand is in the DataResults function. Here it looks like we’re subsampling the mean proportion per ID from meanAct. Is that right? Or is it actively calculating a mean proportion per ID per activity per period for every subsample (1000 times)? The second is what I thought it would do – but maybe those give you the same result?

          – user9351962
          Jan 20 at 2:43













          I cannot answer all your questions as it sounds more like you need formulating your needs beyond programming. Here by splits your data frame into smaller data frames filtered to every combination of ID and Period (if needed add Activity) and passes subsets into DataResults. So all operations within the function including meanAct works on filtered subsets (never the whole data frame).

          – Parfait
          Jan 21 at 15:15





          I cannot answer all your questions as it sounds more like you need formulating your needs beyond programming. Here by splits your data frame into smaller data frames filtered to every combination of ID and Period (if needed add Activity) and passes subsets into DataResults. So all operations within the function including meanAct works on filtered subsets (never the whole data frame).

          – Parfait
          Jan 21 at 15:15


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54263790%2fhow-to-subsample-different-numbers-by-id-and-bootstrap-in-r%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Homophylophilia

          Updating UILabel text programmatically using a function

          Cloud Functions - OpenCV Videocapture Read method fails for larger files from cloud storage