Home > OS >  R: Custom randomization test function to test variables in a data frame
R: Custom randomization test function to test variables in a data frame

Time:05-26

I'm doing an assignment in R where I need to take a data frame with multiple variables and create a function() that resamples the absolute mean differences between the two categories in the data frame.

For the sake of my question I'll add an example data frame:

Variable 1 Variable 2 Variable 3 Category
1 2 3 1
4 5 6 1
7 8 9 2
10 11 12 2

The function needs to accept three arguments: a numeric vector, the two categories within the data frame, and nsim (number of times to resample randomly). The output should be a vector of length nsim with the resampled absolute mean differences.

This is the function I've tried, but when testing the output is always "Nan".

setseed(12345)
test<-function(x, category1, category2, nsim){
 resampled<-sample(df, size=length(nrow(df)), replace=F)
 category1.mean<-sum(df$x[resampled=="category1"])/length(df$x[resampled=="category1"])
 category2.mean<-sum(df$x[resampled=="category2"])/length(df$x[resampled=="category2"])
 return(abs(category1.mean-category2.mean)}

I'm not sure if I'm misunderstanding anything based on how function() works or if I'm misunderstanding the question or the data but I've tried a few things to try to fix the Nan output without success.

Can anyone help me out?

CodePudding user response:

The code below uses replicate to run nsim times the resampling and calculations function f.

x<-'Variable1   Variable2   Variable3   Category
1   2   3   1
4   5   6   1
7   8   9   2
10  11  12  2'
df1 <- read.table(textConnection(x), header = TRUE)

test <- function(data, x, category1, category2, nsim){
  f <- function(data, x, category1, category2) {
    i <- sample(nrow(data), replace = TRUE)
    d <- data[i, ]
    j1 <- which(d[["Category"]] == category1)
    j2 <- which(d[["Category"]] == category2)
    v1 <- d[j1, x, drop = TRUE]
    v2 <- d[j2, x, drop = TRUE]
    diff_means <- if(length(v1) == 0 & length(v2) == 0) {
      NaN
    } else if(length(v1) == 0) {
      mean(v2)
    } else if(length(v2) == 0) {
      mean(v1)
    } else mean(v1) - mean(v2)
    abs(diff_means)
  }
  replicate(nsim, f(data, x, category1, category2))
}

set.seed(2022)

# amd: absolute mean differences
amd <- test(df1, "Variable1", 1, 2, nsim = 1e3)
hist(amd)

Created on 2022-05-26 by the reprex package (v2.0.1)

  • Related