How to randomize or shuffle only one-axis and estimate estimated correlation in R?-CodePudding

I am trying to create a null expectation in R. The dataset has four species and associated sample values of x.real and y.real. I want to create a null expectation by only shuffling the x.real for each species. Estimate slope. Repeat 1000 times, say. And then calculate the mean of the slope.

Here is my attempt. I get an error that says Error in filter(., sp == "A") : object '*tmp*' not found Any suggestions on how to correct this to get the desired output (attached).

set.seed(111)
library(truncnorm)
x.real <- rtruncnorm(n = 288,a = 0,b = 10,mean = 5,sd = 2)
y.real <- rnorm(288,0,4)
sp <- rep(c("A","B","C","D"), each = 72)

df <- data.frame(x.real, y.real, sp)

output <- tibble(y.intercept = numeric(),
                 slope = numeric(),
                 sp = character(),
                 set = numeric())

set.seed(42)                    
for(i in 1:1000){
  samp1 <- df %>% filter(sp == 'A') %>% mod1 <- lm(y.real ~ sample(x.real, length(x.real), replace = TRUE)) %>% summarise(y.intercept = mean(mod1$coefficients[1]), slope =  mean(mod1$coefficients[2])) %>% mutate(set = i)
  samp2 <- df %>% filter(sp == 'B') %>% mod2 <-lm(y.real ~ sample(x.real, length(x.real),replace = TRUE)) %>% summarise(y.intercept = mean(mod2$coefficients[1]), slope =  mean(mod2$coefficients[2])) %>% mutate(set = i)
  samp3 <- df %>% filter(sp == 'C') %>% mod3 <-lm(y.real ~ sample(x.real, length(x.real),replace = TRUE)) %>% summarise(y.intercept = mean(mod3$coefficients[1]), slope =  mean(mod3$coefficients[2])) %>% mutate(set = i)
  samp4 <- df %>% filter(sp == 'D') %>% mod4 <-lm(y.real ~ sample(x.real, length(x.real),replace = TRUE)) %>% summarise(y.intercept = mean(mod4$coefficients[1]), slope =  mean(mod4$coefficients[2])) %>% mutate(set = i)
  output %>% add_row(bind_rows(samp1, samp2, samp3, samp4))  -> output
}

 y.intercept slope sp
     2         2    A
     3         1    B
     1        -4    C
     2        -1    D

CodePudding user response：

You can do this as follows:

set df as data.table
create small function f that estimates the model for a given x,y, and returns the coefficients
run the function 1000 times by group (i.e. by species), and return the result using rbindlist (this will return a data.table of length 1000x4x2 = 8000 rows)
add a column (term) indicating which rows are intercept and which are slope
by sp and term, get the mean of the coefficient

# set df as data.table
setDT(df)

# function to estimate model coefficients
f <- function(x,y) lm(y~sample(x, length(x), replace=T))$coef

# 1000 models for each species
result = rbindlist(
  lapply(1:1000, \(i) df[,.(est = f(x.real,y.real)), sp][, i:=i])
)

# add a columns for intercept/slope
result[, terms:=rep(c("intercept", "slope"), length.out=nrow(result))]

# get mean of intercept/slope by species, and use `dcast` to turn to wide format

dcast(
  result[, .(mean_coef = mean(est)), .(sp,terms)],
  sp~terms, value.var = "mean_coef"
)

Output:

   sp  intercept        slope
1:  A -0.6161870  0.001203426
2:  B  0.2314204  0.001683623
3:  C -0.7333392  0.008367585
4:  D  0.1584692 -0.005298080

If you want to follow the tidyverse/for loop approach you were using, here are some edits:

library(tidyverse)
output <- tibble(est = numeric(),
                 sp = character(),
                 set = numeric())
set.seed(42)                    
for(i in 1:1000){
  sampa = df %>% filter(sp=="A") %>% summarize(est = lm(y.real~sample(x.real, n(), replace=T))$coef) %>% mutate(sp="A", set=i)
  sampb = df %>% filter(sp=="B") %>% summarize(est = lm(y.real~sample(x.real, n(), replace=T))$coef) %>% mutate(sp="B", set=i)
  sampc = df %>% filter(sp=="C") %>% summarize(est = lm(y.real~sample(x.real, n(), replace=T))$coef) %>% mutate(sp="C", set=i)
  sampd = df %>% filter(sp=="D") %>% summarize(est = lm(y.real~sample(x.real, n(), replace=T))$coef) %>% mutate(sp="D", set=i)
  output %>% add_row(bind_rows(sampa, sampb, sampc, sampd))  -> output
}
output %>% 
  mutate(term = rep(c("intercept", "slope"), length.out=nrow(output))) %>% 
  group_by(sp, term) %>% 
  summarize(mean_coef = mean(est)) %>% 
  pivot_wider(sp, names_from=term, values_from=mean_coef)

Similarly, here, output has 8000 rows (one row for each of 1000 iterations x 2 coefficients x 4 species). I similarly use mutate to add a term column, and then, grouping by species and term, get the mean of the estimate, and pivot wider to get 4 rows:

  sp    intercept     slope
  <chr>     <dbl>     <dbl>
1 A        -0.603 -0.000827
2 B         0.272 -0.00587 
3 C        -0.692  0.000627
4 D         0.146 -0.00351

I found the approach with data.table and lapply() to be about 10 times faster.