Use index variable in a for loop as a column name within Dplyr's summarize function in R-CodePudding

Issue: I'm having trouble getting Dplyr's summarize function to recognize that the index variable is a column name.

for (i in colnames(df)){
    temp_frame <- df%>%
      select(group, i)%>%
      group_by(group)%>%
      summarize(yes = sum(i == "1"), no = sum(i=="0"))
}

Full project: I have a data frame of variables I'm attempting run stats on. Because it's so long, I'm trying to create a for loop that takes the independent variable column (group) and each dependent variable column and creates a bunch of tables that are properly formatted to run a fisher's exact test on. Without a for loop, the code for the fisher's exact runs perfectly, but once I try to make the tables in the for loop the summarize function doesn't seem to understand that i is a column name within the temp_frame.

Here's the full code with some stand-in original data

#create starting data frame
df <- data.frame(
  group= c("control", "control", "control", "experimental", "experimental", "experimental"),
  a = c(1,0,1,1,1,0),
  b = c(0,1,0,0,1,1)
)

#create empty stats data frame
stats <- data.frame(name = "", fish="")

for (i in colnames(df)){
  if (i == group){ #skip over the "group" column to avoid trying to make a group vs group test
    next
  } else{
    temp_frame <- df%>%
      select(group, i)%>%
      group_by(group)%>%
      summarize(yes = sum(i == "1"), no = sum(i=="0"))%>% 
      select(-group)
    
    stats[nrow(stats)   1,] = c(i, fisher.test(temp_frame)$p.value) # add each p value to the stats data frame

  }
}

Outside of the for loop, were I to just specify the a column in the summarize function, the temp_frame would first look like this, which is exactly what I want:

yes	no
2	1
2	1

but instead, I'm just getting

yes	no
0	0
0	0

I think it's because it's not recognizing that a is a column name and instead, just giving me the output as if a was a string.

How do I tell it that i represents a column name within the temp_frame df?

CodePudding user response：

I figured it out!

Within the for loop, I added a temporary column to the original df that matched the loop variable's column and then used that temporary column instead of i in the dplyr functions.

for (i in colnames(df)){
  if (i == group){
    next
  } else{
df$temp_column <- df[,which(colnames(df)==i)]
    temp_frame <- df%>%
      select(group, temp_column)%>%
      group_by(group)%>%
      summarize(yes = sum(temp_column == "1"), no = sum(temp_column=="0"))%>% 
      select(-group)
    
    stats[nrow(stats)   1,] = c(i, fisher.test(temp_frame)$p.value)
  }
}

CodePudding user response：

For sure there migth be a better solution which does not involve a for loop, and some other users will certainly find a way to do this using dplyr verbs like across.

But I appreciate you found a solution that works, and I want to focus on your code.

The key point was to use eval(as.symbol(i)). I got it from a post here called Getting strings recognized as variable names, which describes your problem accurately.

The combination of base R with tidyverse is sometimes tricky. For example, I had to add some "" to "group" in the if statement.

Something else is that I replaced the == operators, which only work element wise and are not useful when you want to scan a whole column. Use %in% instead.

Also I removed the obsolete select before the group_by.

#create starting data frame
df <- data.frame(
  group= c("control", "control", "control", "experimental", "experimental", "experimental"),
  a = c(1,0,1,1,1,0),
  b = c(0,1,0,0,1,1)
)

#create empty stats data frame
stats <- data.frame(name = "", fish="")

for (i in colnames(df)){
  if (i == "group"){ 
    #skip over the "group" column to avoid trying to make a group vs group test
    next
  } else{
    temp_frame <- df %>%
      group_by(group) %>% 
      summarise(yes = sum(eval(as.symbol(i)) %in% 1),
                no  = sum(eval(as.symbol(i)) %in% 0)) %>%
      select(-group)
    
    stats[nrow(stats)   1,] = c(i, fisher.test(temp_frame)$p.value) 
    # add each p value to the stats data frame
    
  }
}

Output for i = "b"

> temp_frame
# A tibble: 2 × 2
    yes    no
  <int> <int>
1     1     2
2     2     1

Hope this helps!