Issue: I'm having trouble getting Dplyr's summarize function to recognize that the index variable is a column name.
for (i in colnames(df)){
temp_frame <- df%>%
select(group, i)%>%
group_by(group)%>%
summarize(yes = sum(i == "1"), no = sum(i=="0"))
}
Full project: I have a data frame of variables I'm attempting run stats on. Because it's so long, I'm trying to create a for loop that takes the independent variable column (group) and each dependent variable column and creates a bunch of tables that are properly formatted to run a fisher's exact test on. Without a for loop, the code for the fisher's exact runs perfectly, but once I try to make the tables in the for loop the summarize function doesn't seem to understand that i is a column name within the temp_frame.
Here's the full code with some stand-in original data
#create starting data frame
df <- data.frame(
group= c("control", "control", "control", "experimental", "experimental", "experimental"),
a = c(1,0,1,1,1,0),
b = c(0,1,0,0,1,1)
)
#create empty stats data frame
stats <- data.frame(name = "", fish="")
for (i in colnames(df)){
if (i == group){ #skip over the "group" column to avoid trying to make a group vs group test
next
} else{
temp_frame <- df%>%
select(group, i)%>%
group_by(group)%>%
summarize(yes = sum(i == "1"), no = sum(i=="0"))%>%
select(-group)
stats[nrow(stats) 1,] = c(i, fisher.test(temp_frame)$p.value) # add each p value to the stats data frame
}
}
Outside of the for loop, were I to just specify the a column in the summarize function, the temp_frame would first look like this, which is exactly what I want:
yes | no |
---|---|
2 | 1 |
2 | 1 |
but instead, I'm just getting
yes | no |
---|---|
0 | 0 |
0 | 0 |
I think it's because it's not recognizing that a is a column name and instead, just giving me the output as if a was a string.
How do I tell it that i represents a column name within the temp_frame df?
CodePudding user response:
I figured it out!
Within the for loop, I added a temporary column to the original df that matched the loop variable's column and then used that temporary column instead of i in the dplyr functions.
for (i in colnames(df)){
if (i == group){
next
} else{
df$temp_column <- df[,which(colnames(df)==i)]
temp_frame <- df%>%
select(group, temp_column)%>%
group_by(group)%>%
summarize(yes = sum(temp_column == "1"), no = sum(temp_column=="0"))%>%
select(-group)
stats[nrow(stats) 1,] = c(i, fisher.test(temp_frame)$p.value)
}
}
CodePudding user response:
For sure there migth be a better solution which does not involve a for
loop, and some other users will certainly find a way to do this using dplyr verbs like across
.
But I appreciate you found a solution that works, and I want to focus on your code.
The key point was to use eval(as.symbol(i))
. I got it from a post here called Getting strings recognized as variable names, which describes your problem accurately.
The combination of base R with tidyverse is sometimes tricky. For example, I had to add some "" to "group" in the if
statement.
Something else is that I replaced the ==
operators, which only work element wise and are not useful when you want to scan a whole column. Use %in%
instead.
Also I removed the obsolete select
before the group_by
.
#create starting data frame
df <- data.frame(
group= c("control", "control", "control", "experimental", "experimental", "experimental"),
a = c(1,0,1,1,1,0),
b = c(0,1,0,0,1,1)
)
#create empty stats data frame
stats <- data.frame(name = "", fish="")
for (i in colnames(df)){
if (i == "group"){
#skip over the "group" column to avoid trying to make a group vs group test
next
} else{
temp_frame <- df %>%
group_by(group) %>%
summarise(yes = sum(eval(as.symbol(i)) %in% 1),
no = sum(eval(as.symbol(i)) %in% 0)) %>%
select(-group)
stats[nrow(stats) 1,] = c(i, fisher.test(temp_frame)$p.value)
# add each p value to the stats data frame
}
}
Output for i = "b"
> temp_frame
# A tibble: 2 × 2
yes no
<int> <int>
1 1 2
2 2 1
Hope this helps!