Home > database >  Combine dynamic number of variables within dplyr function
Combine dynamic number of variables within dplyr function

Time:09-28

I would like to get the values from multiple columns and rows of a data frame, do something with those values (e.g., calculate a mean), then add the results to a new column in a modified version of the original data frame. The columns and rows are selected based on values in other columns. I've gotten this working in dplyr, but only when I "hard code" the column names that are used by the function. Example code:

    ''' 
        # create a test data frame 
        tdf <- data.frame(A=c('a','a','b','b','b'), B=c('d','d','e','e','f'),
               L1=as.numeric(c(1,2,3,4,5)), L2=as.numeric(c(11,12,13,'na',15)),
               L3=as.numeric(c('na',22,23,'na',25)), stringsAsFactors=FALSE)
    '''

which gives

      A B L1 L2 L3
    1 a d  1 11 NA
    2 a d  2 12 22
    3 b e  3 13 23
    4 b e  4 NA NA
    5 b f  5 15 25

In this case, I would like to calculate the mean of the values in the L* columns (L1,L2,L3) that have the same values in columns A and B. For example, using the values in A and B, I select rows 1 & 2 and calculate a mean using (1, 11, 2, 12, 22), then rows 3 & 4 (3, 13, 23, 4), and finally row 5 (5, 15, 25).

I can do this using dplyr using either (both work):

    '''
    ddply(tdf, .(A, B), summarize, mean_L=mean(c(L1, L2, L3), na.rm=TRUE))
    or
    tdf %>% group_by(A,B) %>% summarize(mean_L=mean(c(L1,L2,L3), na.rm=TRUE))
    '''

which gives what I want:

      A B mean_L
    1 a d   9.60
    2 b e  10.75
    3 b f  15.00

However, my issue is that the number of "L" columns is dynamic among different data sets. In some cases I may have 10 total columns (L1, L2, ... L10) or 100 columns. The columns I use for the selection criteria (in this case A and B), will always be the same, so I can "hard code" those, but I'm having difficulty specifying the columns in the "mean" function.

dplyr has a way of dynamically generating the "group by" variables, but that does not seem to work within the function component of the summarize. For example, I can do this:

    '''
    b <- names(tdf)[1:2]
    dots <- lapply(b, as.symbol)
    tdf %>% group_by(.dots=dots) %>% summarize(mean_L=mean(c(L1,L2,L3), na.rm=TRUE))
    '''

but I can't do the same inside the mean function. The closest I have come to working is:

    '''
    b='L1'
    tdf %>% group_by(A,B) %>% summarize(mean_L=mean(.data[[b]], na.rm=TRUE))
    '''

but this only works for specifying a single column. If I try b='L1,L2,L3', it seems dplyr uses the literal "L1,L2,L3" as a column name and not as a list.

This doesn't seem to be a complicated problem, but I would like help finding the solution, either in dplyr or some other way.

Many thanks!

CodePudding user response:

tdf %>% 
group_by_at(1:2) %>% 
summarise(mean_L=mean(c_across(starts_with("L")),
                      na.rm=TRUE)) %>% 
ungroup()

CodePudding user response:

No matter how many L columns you have you can always think of transforming your data set into long format and group them based on your variables:

library(tidyverse)

df %>%
  pivot_longer(!c(A, B)) %>%
  group_by(A, B) %>%
  summarise(L_mean = mean(value, na.rm = TRUE))

# A tibble: 3 × 3
# Groups:   A [2]
  A     B     L_mean
  <chr> <chr>  <dbl>
1 a     d        9.6
2 b     e       10.8
3 b     f       15  
  • Related