I would like to get the values from multiple columns and rows of a data frame, do something with those values (e.g., calculate a mean), then add the results to a new column in a modified version of the original data frame. The columns and rows are selected based on values in other columns. I've gotten this working in dplyr, but only when I "hard code" the column names that are used by the function. Example code:
'''
# create a test data frame
tdf <- data.frame(A=c('a','a','b','b','b'), B=c('d','d','e','e','f'),
L1=as.numeric(c(1,2,3,4,5)), L2=as.numeric(c(11,12,13,'na',15)),
L3=as.numeric(c('na',22,23,'na',25)), stringsAsFactors=FALSE)
'''
which gives
A B L1 L2 L3
1 a d 1 11 NA
2 a d 2 12 22
3 b e 3 13 23
4 b e 4 NA NA
5 b f 5 15 25
In this case, I would like to calculate the mean of the values in the L* columns (L1,L2,L3) that have the same values in columns A and B. For example, using the values in A and B, I select rows 1 & 2 and calculate a mean using (1, 11, 2, 12, 22), then rows 3 & 4 (3, 13, 23, 4), and finally row 5 (5, 15, 25).
I can do this using dplyr using either (both work):
'''
ddply(tdf, .(A, B), summarize, mean_L=mean(c(L1, L2, L3), na.rm=TRUE))
or
tdf %>% group_by(A,B) %>% summarize(mean_L=mean(c(L1,L2,L3), na.rm=TRUE))
'''
which gives what I want:
A B mean_L
1 a d 9.60
2 b e 10.75
3 b f 15.00
However, my issue is that the number of "L" columns is dynamic among different data sets. In some cases I may have 10 total columns (L1, L2, ... L10) or 100 columns. The columns I use for the selection criteria (in this case A and B), will always be the same, so I can "hard code" those, but I'm having difficulty specifying the columns in the "mean" function.
dplyr has a way of dynamically generating the "group by" variables, but that does not seem to work within the function component of the summarize. For example, I can do this:
'''
b <- names(tdf)[1:2]
dots <- lapply(b, as.symbol)
tdf %>% group_by(.dots=dots) %>% summarize(mean_L=mean(c(L1,L2,L3), na.rm=TRUE))
'''
but I can't do the same inside the mean function. The closest I have come to working is:
'''
b='L1'
tdf %>% group_by(A,B) %>% summarize(mean_L=mean(.data[[b]], na.rm=TRUE))
'''
but this only works for specifying a single column. If I try b='L1,L2,L3', it seems dplyr uses the literal "L1,L2,L3" as a column name and not as a list.
This doesn't seem to be a complicated problem, but I would like help finding the solution, either in dplyr or some other way.
Many thanks!
CodePudding user response:
tdf %>%
group_by_at(1:2) %>%
summarise(mean_L=mean(c_across(starts_with("L")),
na.rm=TRUE)) %>%
ungroup()
CodePudding user response:
No matter how many L
columns you have you can always think of transforming your data set into long format and group them based on your variables:
library(tidyverse)
df %>%
pivot_longer(!c(A, B)) %>%
group_by(A, B) %>%
summarise(L_mean = mean(value, na.rm = TRUE))
# A tibble: 3 × 3
# Groups: A [2]
A B L_mean
<chr> <chr> <dbl>
1 a d 9.6
2 b e 10.8
3 b f 15