Home > database >  Using summarize(across(..., .fns = ...)) with a multi-variate function
Using summarize(across(..., .fns = ...)) with a multi-variate function

Time:12-26

My problem requires me to summarize data across multiple columns, but each column must be summarized by a multi-variate function of three other columns.

I have a data frame with hundreds of columns carrying different statistical information about a dataset. Here is a similarly structured, smaller dataframe.

df <- data.frame(a1_Avg = rnorm(10), 
                 a1_Std = runif(10), 
                 a2_Avg = rnorm(10), 
                 a2_Std = runif(10), 
                 Hour = c(1.0, 1.5, 2.0, 2.25, 2.5, 2.75, 3.0, 4.0, 4.5, 5.0),
                 Measurements = c(3, 3, 6, 6, 6, 6, 10, 7, 7, 2)) %>%

The data needs to condensed into rows summarizing one-hour chunks. Summarizing the averages is easy enough: I can simply average them, since the number of measurements is consistent within an hour.

  group_by(Hour) %>%
  summarize(across(matches("a._Avg"), ~ mean(.x), .names = "combined_{col}"),

But combining standard deviations is trickier, since I need information from three separate columns to compute it. Manually, I would do it like this:

            combined_a1_Std = sqrt((1/n())*sum(a1_Std^2   (a1_Avg - combined_a1_Avg)^2)),
            combined_a2_Std = sqrt((1/n())*sum(a2_Std^2   (a2_Avg - combined_a2_Avg)^2)))

But this isn't feasible for hundreds of columns.

Is there a simple way to do this?

Here's the full code above, and the desired output:

set.seed(1)
df <- data.frame(a1_Avg = rnorm(10), 
                 a1_Std = runif(10), 
                 a2_Avg = rnorm(10), 
                 a2_Std = runif(10), 
                 Hour = c(1.0, 1.5, 2.0, 2.25, 2.5, 2.75, 3.0, 4.0, 4.5, 5.0),
                 Measurements = c(3, 3, 6, 6, 6, 6, 10, 7, 7, 2)) %>%
  mutate(Hour = floor(Hour)) %>%
  group_by(Hour) %>%
  summarize(across(matches("a._Avg"), ~ mean(.x), .names = "combined_{col}"),
            combined_a1_Std = sqrt((1/n())*sum(a1_Std^2   (a1_Avg - combined_a1_Avg)^2)),
            combined_a2_Std = sqrt((1/n())*sum(a2_Std^2   (a2_Avg - combined_a2_Avg)^2)))

df

   Hour combined_a1_Avg combined_a2_Avg combined_a1_Std combined_a2_Std
  <dbl>           <dbl>           <dbl>           <dbl>           <dbl>
1     1         -0.221          -0.0306           0.859           0.859
2     2          0.0672          0.819            1.17            1.17 
3     3          0.487           0.782            0.116           0.116
4     4          0.657          -0.957            0.795           0.795
5     5         -0.305           0.620            0.583           0.583

CodePudding user response:

An option is to loop over one set of columns, then get the other set by replacing the substring in the column names

library(dplyr)
library(stringr)
out2 <- df %>% 
   mutate(Hour = floor(Hour)) %>%
   group_by(Hour) %>%
   summarize(across(matches("a\\d _Avg"), ~ mean(.x),
    .names = "combined_{col}"), 
         across(matches('^a\\d _Avg$'),
     ~ sqrt((1/n())*sum(get(str_replace(cur_column(), "Avg", "Std"))  
                   (. - get(str_c( "combined_", cur_column() )))^2)), 
      .names = "combined_{str_replace(.col, 'Avg', 'Std')}"))

-checking with OP's manual approach

out1 <- df %>%
   mutate(Hour = floor(Hour)) %>%
  group_by(Hour) %>%
  summarize(across(matches("a._Avg"), ~ mean(.x), .names = "combined_{col}"),
            combined_a1_Std = sqrt((1/n())*sum(a1_Std   (a1_Avg - combined_a1_Avg)^2)),
            combined_a2_Std = sqrt((1/n())*sum(a2_Std   (a2_Avg - combined_a2_Avg)^2)))
identical(out1, out2)
[1] TRUE

data

set.seed(1)
df <- data.frame(a1_Avg = rnorm(10), 
                 a1_Std = runif(10), 
                 a2_Avg = rnorm(10), 
                 a2_Std = runif(10), 
                 Hour = c(1.0, 1.5, 2.0, 2.25, 2.5, 2.75, 3.0, 4.0, 4.5, 5.0),
                 Measurements = c(3, 3, 6, 6, 6, 6, 10, 7, 7, 2))
  • Related