I have a matrix with zero's and one's. ~30% of the sample are 1s, I want to estimate a confidence-interval around this percentage (e.g., "if I sampled the whole population there would be likely 28-32% "1"s). For doing so you can bootstrap from the sample, (redraw the sample N times from itself with replacement and analyze the distribution of the percentage of 1s over all redrawn samples). However my data is nested (highly correlated) within rows and within columns. I tried out whether this nestedness makes a difference (since I have dichotmous variables I can use rflip() which simulates a biased coinflip), it does:
library("mosaic")
#### data example ####
c1<-c(1,1,1,1,1,0,0,0,0,0) # high probability for "1"
c2<-c(1,0,0,0,0,0,0,0,0,0) # low probability for "1"
d<-cbind.data.frame(c1,c2)
#### a) resample over entire data ####
b<-vector()
for (i in 1:10000){
b[i] <- rflip(20, # Flip 20 times,
6/20)/ # Probability for "1": 6/20, i.e., probability for "0": 14/20
20 # divide by 20 to return relative frequency
}
mean(b)# returns 0.3007955 # mean over 10000 replications: close to 6/20
sd(b) # returns 0.1024339 # standard deviation important to compute confidence interval
#### b) resample per column ####
b1 <- vector()
b2 <- vector()
bt <- vector()
for (i in 1:10000){
b1[i] <- rflip(10,(5/10)) # Flip 10 times with probablility for c1
b2[i] <- rflip(10,(1/10)) # Flip 10 times with probablility for c2
bt[i] <- (b1[i] b2[i])/20 # sum up all 20 flips and divide by 20 to return relative frequency
}
mean(bt)# returns 0.3001475 # mean similar to a)
sd(bt) # returns 0.09214384 # standard deviation smaller than a)
When I redraw 10 times from column c1 and 10 times from column c2 and replicate this process 10,000 times the distribution of the observed probabilities is more narrow than when I sample 20 times from the entire data. If the probability for "1" is identical in both columns approaches a) and b) lead to the same standard deviation.
I now want to consider not only the columns but also the rows, e.g. I want to draw 10 times from column 1 and 10 times from column 2 and I want to constrain that among these 20 draws there must be two draws per row. My first idea would be:
forloop{
- Randomize order of columns
- draw 10 times from column 1 but constrain that there are maximum 2 redraws per row
- draw 10 times from column 2 but constrain that the redraws from column 1 plus the redraws from column 2 are maximum 2 redraws per row (if we had 2 redraws from row 1 for column 1, no redraws from row 1 for column 2)
}
Does anybody have an idea about how to do that or has anybody got a better idea? Must probably be a different function than rflip(). Would help me a lot!
Thanks, ajj
CodePudding user response:
Take a look at r2dtable
.
nrows <- 10L
ncols <- 6L
nr <- rep(ncols, nrows)
nc <- rep(nrows, ncols)
m <- r2dtable(1, nr, nc)[[1]]
m
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 2 0 1 1 1 1
#> [2,] 2 1 1 0 1 1
#> [3,] 0 1 1 2 2 0
#> [4,] 0 3 1 1 0 1
#> [5,] 1 2 1 0 1 1
#> [6,] 0 0 1 2 0 3
#> [7,] 3 0 1 1 0 1
#> [8,] 2 0 0 1 3 0
#> [9,] 0 0 1 2 1 2
#> [10,] 0 3 2 0 1 0
rowSums(m)
#> [1] 6 6 6 6 6 6 6 6 6 6
colSums(m)
#> [1] 10 10 10 10 10 10
I want to draw 10 times from column 1 and 10 times from column 2 and I want to constrain that among these 20 draws there must be two draws per row.
That would be:
nrows <- 10L
ncols <- 2L
nr <- rep(ncols, nrows)
nc <- rep(nrows, ncols)
m <- r2dtable(1, nr, nc)[[1]]
m
#> [,1] [,2]
#> [1,] 2 0
#> [2,] 0 2
#> [3,] 1 1
#> [4,] 2 0
#> [5,] 0 2
#> [6,] 2 0
#> [7,] 1 1
#> [8,] 1 1
#> [9,] 0 2
#> [10,] 1 1
CodePudding user response:
sample(rep(1:10,2),size=10,replace=FALSE)