Hello package maintainers!
I am building confidence intervals for groups with bootstrapped values and I'm having trouble creating multiple re-sampled datasets from which to build my confidence intervals.
Using the palmerpenguins library as an example:
library(tidyverse)
library(infer)
library(palmerpenguins)
There are 344 total observations and each species has a different number of observations:
nrow(penguins)
# [1] 344
penguins %>% group_by(species) %>% count()
# A tibble: 3 × 2
# Groups: species [3]
# species n
<fct> <int>
#1 Adelie 152
#2 Chinstrap 68
#3 Gentoo 124
I want to be able to group by the species, and for each species pull multiple samples while using the original number of observations per each group.
set.seed(100)
slices <- penguins2 %>%
group_by(species) %>%
rep_slice_sample(prop = 1, replace = TRUE, reps = 10)
That should give me 344 * 10 = 3440 lines in the full new data set. This is true, but when you look at the data you can see that each replicate has a different number of observations. For all of the Adelie, n per sample should be 152, chinstrap should be 68, and Gentoo should be 124. Instead we find this:
slices %>% group_by(species, replicate) %>% count()
# A tibble: 30 × 3
# Groups: species, replicate [30]
# species replicate n
# <fct> <int> <int>
#1 Adelie 1 148
#2 Adelie 2 147
# 3 Adelie 3 148
# 4 Adelie 4 151
# 5 Adelie 5 138
# 6 Adelie 6 157
# 7 Adelie 7 161
# 8 Adelie 8 157
# 9 Adelie 9 151
#10 Adelie 10 138
# ℹ 20 more rows
# ℹ Use `print(n = ...)` to see more rows
What am I missing?
thanks for your insight.
Hello package maintainers!
I am building confidence intervals for groups with bootstrapped values and I'm having trouble creating multiple re-sampled datasets from which to build my confidence intervals.
Using the palmerpenguins library as an example:
There are 344 total observations and each species has a different number of observations:
I want to be able to group by the species, and for each species pull multiple samples while using the original number of observations per each group.
That should give me 344 * 10 = 3440 lines in the full new data set. This is true, but when you look at the data you can see that each replicate has a different number of observations. For all of the Adelie, n per sample should be 152, chinstrap should be 68, and Gentoo should be 124. Instead we find this:
What am I missing?
thanks for your insight.