How do I perform a bootstrap in R and estimate the variance?-CodePudding

The dataset, "male.wt" is a collection of 100 weights of male taxi patrons. Use bootstrap sampling to estimate the variance for the population of males who use taxis.

I am trying to use the boot() function in R and I am just completely confused. Here is the data set that was given to me to do this problem.

malewt = structure(list(x = c(184.291514203183, 238.183299307855, 217.544606414151, 
233.931926116624, 229.12042611005, 243.881689583996, 259.230802242781, 
217.939619221934, 137.636923032685, 170.379447345948, 195.852641733122, 
185.832690963969, 186.676714564328, 215.711426139253, 186.413495533494, 
237.83223009147, 180.124153998503, 215.393108191779, 188.846039074142, 
142.373198101437, 233.234630310378, 186.141325709762, 220.062112044187, 
213.851199681057, 148.622198219149, 197.438771523918, 206.920961557603, 
190.874857845699, 217.889075914836, 152.318099234166, 218.089620221194, 
196.736930479919, 235.122424359223, 217.446826955801, 201.352404389309, 
216.290374765672, 173.85609629461, 215.961826427613, 213.87732008193, 
177.952521505061, 132.734879010504, 221.707886490889, 224.336488758995, 
218.604034088911, 228.157844234374, 196.544661577149, 228.787736646279, 
237.009125179319, 194.73342863066, 190.569523115323, 192.198491573128, 
204.589742888237, 198.662802876867, 195.238634847898, 201.834508205684, 
220.989134791548, 180.006492709174, 168.199898332071, 250.705048451896, 
209.824701073225, 212.36145906497, 205.250728119598, 196.572466206237, 
186.818746613236, 138.493748904934, 193.572713536688, 171.605082170236, 
243.803356964054, 188.768040728907, 201.408088256783, 196.23847341016, 
202.686141019735, 167.25735383257, 171.907526464761, 224.396425425799, 
183.494470842407, 220.15969728649, 143.164453849305, 152.539942653094, 
198.52004650272, 185.145815429412, 206.741840856439, 259.866591064748, 
135.212011256414, 164.2297511973, 200.623731663392, 199.599177980586, 
175.970651370212, 197.304554981825, 189.116019204125, 198.630618004183, 
185.096675814379, 203.780160863916, 174.584831373708, 150.483001599829, 
223.78078870159, 170.772181294322, 218.770812392057, 151.645084212409, 
210.350813872005)), class = "data.frame", row.names = c(NA, -100L
))

CodePudding user response：

Very ambiguous question. Here is how to plot a histogram of the bootstrap estimator of the variance:

library(purrr)
boots <- 100
data <- structure(list(x = c(184.291514203183, 238.183299307855, 217.544606414151, 233.931926116624, 229.12042611005, 243.881689583996, 259.230802242781, 217.939619221934, 137.636923032685, 170.379447345948, 195.852641733122, 185.832690963969, 186.676714564328, 215.711426139253, 186.413495533494, 237.83223009147, 180.124153998503, 215.393108191779, 188.846039074142, 142.373198101437, 233.234630310378, 186.141325709762, 220.062112044187, 213.851199681057, 148.622198219149, 197.438771523918, 206.920961557603, 190.874857845699, 217.889075914836, 152.318099234166, 218.089620221194, 196.736930479919, 235.122424359223, 217.446826955801, 201.352404389309, 216.290374765672, 173.85609629461, 215.961826427613, 213.87732008193, 177.952521505061, 132.734879010504, 221.707886490889, 224.336488758995, 218.604034088911, 228.157844234374, 196.544661577149, 228.787736646279, 237.009125179319, 194.73342863066, 190.569523115323, 192.198491573128, 204.589742888237, 198.662802876867, 195.238634847898, 201.834508205684, 220.989134791548, 180.006492709174, 168.199898332071, 250.705048451896, 209.824701073225, 212.36145906497, 205.250728119598, 196.572466206237, 186.818746613236, 138.493748904934, 193.572713536688, 171.605082170236, 243.803356964054, 188.768040728907, 201.408088256783, 196.23847341016, 202.686141019735, 167.25735383257, 171.907526464761, 224.396425425799, 183.494470842407, 220.15969728649, 143.164453849305, 152.539942653094, 198.52004650272, 185.145815429412, 206.741840856439, 259.866591064748, 135.212011256414, 164.2297511973, 200.623731663392, 199.599177980586, 175.970651370212, 197.304554981825, 189.116019204125, 198.630618004183, 185.096675814379, 203.780160863916, 174.584831373708, 150.483001599829, 223.78078870159, 170.772181294322, 218.770812392057, 151.645084212409, 210.350813872005)), class = "data.frame", row.names = c(NA, -100L ))
map(seq_len(boots),
    ~ data$x[sample.int(length(data$x), length(data$x), T)]
) %>% 
    map_dbl(var) %>% 
    hist()

^{Created on 2022-12-09 with reprex v2.0.2}

CodePudding user response：

Here is solution with base package boot. In the case of the variance it's very easy to bootstrap it.

Create a function bootvar to compute each resampled variance;
the function must have the data and indices as 1st and 2nd arguments. The arguments must be these and by this order, the indices vector is what gives the resampling from the data. It's a vector created automatically for the user by boot;
in the function extract the sample given by the indices vector (i below);
compute and return the statistic of interest, var.

I wrote the function in two code lines in order to make it more clear.

library(boot)

set.seed(2022)

bootvar <- function(data, i) {
  y <- data$x[i]
  var(y)
}
b <- boot(malewt, bootvar, R = 5000)
b
#> 
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#> 
#> 
#> Call:
#> boot(data = malewt, statistic = bootvar, R = 5000)
#> 
#> 
#> Bootstrap Statistics :
#>     original    bias    std. error
#> t1* 792.2551 -7.657891    106.0901

# bootstrapped variance
mean(b$t)
#> [1] 784.5972

hist(b$t)

^{Created on 2022-12-09 with reprex v2.0.2}

It is also possible to write the bootstrap function so that it takes a vector as its first argument, not a data.frame like above. The call to boot is then adapted accordingly. The results are exactly the same as long as the pseudo-RNG seed is set to the same value (in this case 2022).

library(boot)

set.seed(2022)

bootvar_x <- function(x, i) {
  y <- x[i]
  var(y)
}
bx <- boot(malewt$x, bootvar_x, R = 5000)
bx
#> 
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#> 
#> 
#> Call:
#> boot(data = malewt$x, statistic = bootvar_x, R = 5000)
#> 
#> 
#> Bootstrap Statistics :
#>     original    bias    std. error
#> t1* 792.2551 -7.657891    106.0901

mean(bx$t)
#> [1] 784.5972

hist(bx$t)

^{Created on 2022-12-09 with reprex v2.0.2}