0

I am trying to perform hierarchical bootstrapping to get some sample means from a large dataset with a nested data structure.

I have a dataset that is analogous to this:

ball <- c(1:13)
box <- c('1', '1', '1', '1', '2', '2', '2',
     '3', '3', '3', '3', '3', '3')
triangles <- c(1,0,1,3,1,1,2,2,0,1,1,0,4)
df <- data.frame(cbind(ball, box, triangles))
df
--
ball box triangles
   1   1         1
   2   1         0
   3   1         1
   4   1         3
   5   2         1
   6   2         1
   7   2         2
   8   3         2
   9   3         0
  10   3         1
  11   3         1
  12   3         0
  13   3         4

And the idea is that there are three boxes, each with a number of balls in them. Each ball has a number of triangles on it, such that it looks something like this: Visual of my data

My goal here is, using bootstrapping, to estimate the mean number of triangles on each ball, while controlling for which box the ball is in.

I want the simulation to sample with replacement 10,000 times from the boxes, each time randomly pulling a box, and then randomly samples the balls n times with replacement where n is the number of balls in the box (i.e. if box 1 is picked, then the simulation will randomly sample those four balls, four times, ending up with any number of responses, ex. ball 1, ball 1, ball 3, ball 4).

I want it to then calculate the mean of the number of triangles on the balls it sampled, store that value, and then sample a new box, thus repeating the process.

So far, I've tried to use an rsample method (described here: ) like this:

#we need to sample groups aka boxes from 
#the dataframe so use list-columns in 
#tibbles
library(tidyverse)
library(tibble)
library(rsample)

Test <- df %>% nest(-box)
head(Test)

#now use bootstraps on this new tibble to 
#sample by ID
set.seed(002)
testbs <- bootstraps(Test, times = 10)
testbs

#let's look at one of the bootstrap 
#samples
as_tibble(testbs$splits[[1]]) %>% head()

#we can unnest the tibble and assess the 
#averages by box 
bs_avgtri<- map(testbs$splits, 
      ~as_tibble(.) %>% unnest() %>% 
                   group_by(box) %>% 
                   summarize(mean_tri = 
                   mean(triangles))) %>% 
                  bind_rows(.id = 'boots')
bs_avgtri

However, I think this is flawed because of how I'm nesting the data. Also the outputs I'm getting don't make sense, often displaying multiple bootstrap levels. So I'm tending to think it's going wrong, but also I'm unsure how to really parse out what the different functions are doing.

I also know the approach I'm borrowing from isn't really meant for what I'm doing, I'm trying to jerry-rig a way of doing it and I don't think it's doing what I need it to do.

The only other way I can think to do this is to write a couple nested for loops, but I am not strong with for loops in R, and I am fairly sure there's a better way.

If anyone has any insight into this I would be very very grateful!!!!

StupidWolf
  • 34,518
  • 14
  • 22
  • 47
colebrookson
  • 131
  • 7

2 Answers2

2

tidyr::crossing is very handy for simulations.

library("tidyverse")

ball <- c(1:13)
box <- c('1', '1', '1', '1', '2', '2', '2',
         '3', '3', '3', '3', '3', '3')
triangles <- c(1,0,1,3,1,1,2,2,0,1,1,0,4)
df <- tibble(ball, box, triangles)

df %>%
  # How many times do you want to run the simulation?
  crossing(rep = seq(3)) %>%
  # Next describe the sampling.
  # For each simulation and for each box...
  group_by(rep, box) %>%
  # randomly sample n() balls with replacement,
  # where n() is the number of balls in the box.
  sample_n(n(), ball, replace = TRUE) %>%
  # Compute the mean number of triangles (for each replicate, for each box)
  summarise(triangles = mean(triangles))
#> # A tibble: 9 x 3
#> # Groups:   rep [3]
#>     rep box   triangles
#>   <int> <chr>     <dbl>
#> 1     1 1          1.5 
#> 2     1 2          1.67
#> 3     1 3          2   
#> 4     2 1          2   
#> 5     2 2          1.33
#> 6     2 3          1.33
#> 7     3 1          2   
#> 8     3 2          1.67
#> 9     3 3          1.5

Created on 2019-03-04 by the reprex package (v0.2.1)

dipetkov
  • 1,816
  • 3
  • 9
  • Thanks! Question: when I run this, I get 'Error: This function should not be called directly', and as best I can tell, there's something about the sample_n() and using n() inside of it that's causing it not to run for me. I'm running the newest version of R and tidyverse, Is this something you encountered? If so, what did you do to fix it? – colebrookson Mar 04 '19 at 05:51
  • Ran the snippet with `reprex` to check whether I get errors or not. What other packages are loaded in your script? How do you modify the example? – dipetkov Mar 04 '19 at 08:30
  • In particular, if you use `plyr`, make sure you load it before `dplyr`/`tidyverse` as explained here: https://stackoverflow.com/questions/22801153/dplyr-error-in-n-function-should-not-be-called-directly – dipetkov Mar 04 '19 at 08:45
  • From a fresh R session, I load only the tidyverse, then run your code exactly how it is above. I get the 'should not be called directly' error, and then if I call 'rlang::last_error()' I get `:> rlang::last_error() message: This function should not be called directly class: rlang_error backtrace: 1. tidyr::crossing(., rep = seq(3)) 19. dplyr::group_by(., rep, box) 1. dplyr::sample_n(., n(), ball, replace = TRUE) Call 'rlang::last_trace()' to see the full backtrace >` – colebrookson Mar 04 '19 at 14:19
  • and using the full backtrace, the trace ends at `\-dplyr::n()` – colebrookson Mar 04 '19 at 14:24
  • We know how many balls there are. What happens if you replace `n()` with `13` or `dplyr::n()`? These are shots in the dark but I can't reproduce your error. I have `dplyr 0.8.0.1` and `tidyr 0.8.3`. – dipetkov Mar 04 '19 at 14:28
  • Figured it out! Uninstalling and reinstalling tidyverse, as well as making sure that df$triangles was numeric not a factor made it work. This is perfect now, thank you so much!!! Sorry for my dumb questions lol – colebrookson Mar 04 '19 at 14:55
1

I don't know much about rsample.

But according to your description, I think the base function sample is enough.

I wrote a simple version to achieve the mean value (based on my understanding). See if that's what you want.

set.seed(100)

ball <- c(1:13)
box <- c('1', '1', '1', '1', '2', '2', '2',
         '3', '3', '3', '3', '3', '3')
triangles <- c(1,0,1,3,1,1,2,2,0,1,1,0,4)

names(ball) = box
names(triangles) = ball

sample_balls = function(input_ball){
  chosen_box = sample(names(input_ball), 1, replace = T)
  chosen_balls = ball[which(names(input_ball) == chosen_box)]
  sampled_balls = sample(chosen_balls, length(chosen_balls), replace = T)
  return(sampled_balls)
}

nTriangles = unlist(lapply(1:100, function(x){
  nTriangle = triangles[sample_balls(ball)]
}))

mean(nTriangles)
#> [1] 1.331237
J.Li
  • 51
  • 6