Home > database >  Resample from data keeping factor distribution of specific variables
Resample from data keeping factor distribution of specific variables

Time:12-25

I would like to resample my data with replacement while also keeping the proportion (same amount of 1s and Os in the resampled sample) of my two variables (I and O) constant. This is my data:

dat[,c(2,4,7)]
   I O SIDI.F
1  0 0     50
2  1 0     13
3  1 0     13
4  0 1     12
5  0 0     13
6  0 0     15
7  0 1     23
8  0 1     34

Since I could not find a way, I tried to make it easier and split the data set trying to at least keep the proportions constant for O or I:

> dat3
> O SIDI.F
> 1  0     50
> 2  0     13
> 3  0     13
> 4  1     12
> 5  0     13

> dat2
> I SIDI.F
> 1  0     50
> 2  1     13
> 3  1     13
> 4  0     12
> 5  0     13

datBoot2 <- dat2[sample(1:nrow(dat2), 8, replace=TRUE), ]
datBoot3 <- dat3[sample(1:nrow(dat2), 8, replace=TRUE), ]

However, still I can't find a way to keep the proportions (same number of 1s and 0s in the resampled dataset). Please, can anyone help?

CodePudding user response:

sampling (should?) require a kind of randomness... I believe the rbinom() function can be used here. The probablility of succes (x == 1) is calculated for the prob-argument, based on the original input.

mysample <- function(x) rbinom(length(x), 1, sum(x == 1)/length(x))
mysample(dat$O)

CodePudding user response:

Thank you all for your answers! I seem to have found a solution. I found a code on a different post on stackoverflow and changes it for sampling with replacement. Although I do not understand the full code, it seems to work:

sampFreq<-function(cdf,col,ns) { 
  x<-as.factor(cdf[,col])  
  freq_x<-table(x)         
  prob_x<-freq_x/sum(freq_x)  
  df_prob = prob_x[as.factor(cdf[,col])]  
  nr=nrow(cdf) 
  sLevels = levels(as.factor(cdf[,col])) 
  nLevels = length(sLevels) 
  rat = ns/nr
  rdata = NULL
  for (is in seq(1,nLevels)) {
    ldata <- cdf[cdf[,col]==sLevels[is],]
    ndata <- nrow(ldata)
    nsdata = max(ndata*rat,1)
    srows <- sample(seq(1,ndata),nsdata,replace=TRUE)
    sdata <- ldata[srows,]
    rdata <- rbind(rdata,sdata)
  }
  return(rdata)
}

datSample <- sampFreq(dat,"I",19)

Checking the proportion of the new sample via the following code seems to indicate the correct proportion.

freq_x<-table(datSample$I)
freq_x/sum(freq_x)
  • Related