I have a dataframe that has 20,000 observations and I want to bootstrap 700 of those observations, calculate the mean and repeat for 1,000 runs. I know how to code this myself but I was trying to use the "boot" library because of the great plotting and CI options.
df <- seq(1, 20000, 1)
meanfun <- function(data, ind) {
return(mean(data[ind]))
}
library(boot)
results <- boot(df, statistic=meanfun, R=10000)
I have read the documentation and I haven't seen how to CHOOSE the length of "ind".
If I was going to do the hard way, I would use this code:
df <- seq(1, 20000, 1) # dataframe of 10000 observations
meanfun <- function(data) {
return(mean(data))
} # function to calculate mean
S <- numeric(1000) # Vector to store 1000 values from random sampling
for (i in 1:1000) {
one_sample <- sample(df, 700) # sample 700 random observations
print(one_sample)
one_result <- meanfun(one_sample) # find mean of that sampling
S[i] <- one_result # Store that value
}
S
meanfun(S) # average value of 1000 values
But how do I choose to only randomly sample 700 observations 1000 times using the boot function?
Thanks in advance!
CodePudding user response:
What you are doing is subsampling without replacement rather than bootstrapping. I am not aware that this is possible with boot
, since ind
is resampled with replacement and I don't see any way to subset it. However I don't use boot
and might be wrong.
Actually you can do it less cumbersome; just define your subsampling FUN
ction and replicate
it.
FUN <- function() mean(sample(x, 700))
R <- 2e4
set.seed(49076)
S <- replicate(R, FUN())
That's it. You can easily calculate the mean,
mean(S)
# [1] 10000.16
and percentile confidence intervals.
## 95% CI
quantile(S, probs=c(.025, .975))
# 2.5% 97.5%
# 9584.945 10412.164
Also mimicking the plotting functionalities of boot
is straightforward.
op <- par(mfrow=c(1, 2))
hist(S, breaks='FD', freq=FALSE)
qqnorm(S); qqline(S)
par(op)