My problem requires me to summarize data across multiple columns, but each column must be summarized by a multi-variate function of three other columns.
I have a data frame with hundreds of columns carrying different statistical information about a dataset. Here is a similarly structured, smaller dataframe.
df <- data.frame(a1_Avg = rnorm(10),
a1_Std = runif(10),
a2_Avg = rnorm(10),
a2_Std = runif(10),
Hour = c(1.0, 1.5, 2.0, 2.25, 2.5, 2.75, 3.0, 4.0, 4.5, 5.0),
Measurements = c(3, 3, 6, 6, 6, 6, 10, 7, 7, 2)) %>%
The data needs to condensed into rows summarizing one-hour chunks. Summarizing the averages is easy enough: I can simply average them, since the number of measurements is consistent within an hour.
group_by(Hour) %>%
summarize(across(matches("a._Avg"), ~ mean(.x), .names = "combined_{col}"),
But combining standard deviations is trickier, since I need information from three separate columns to compute it. Manually, I would do it like this:
combined_a1_Std = sqrt((1/n())*sum(a1_Std^2 (a1_Avg - combined_a1_Avg)^2)),
combined_a2_Std = sqrt((1/n())*sum(a2_Std^2 (a2_Avg - combined_a2_Avg)^2)))
But this isn't feasible for hundreds of columns.
Is there a simple way to do this?
Here's the full code above, and the desired output:
set.seed(1)
df <- data.frame(a1_Avg = rnorm(10),
a1_Std = runif(10),
a2_Avg = rnorm(10),
a2_Std = runif(10),
Hour = c(1.0, 1.5, 2.0, 2.25, 2.5, 2.75, 3.0, 4.0, 4.5, 5.0),
Measurements = c(3, 3, 6, 6, 6, 6, 10, 7, 7, 2)) %>%
mutate(Hour = floor(Hour)) %>%
group_by(Hour) %>%
summarize(across(matches("a._Avg"), ~ mean(.x), .names = "combined_{col}"),
combined_a1_Std = sqrt((1/n())*sum(a1_Std^2 (a1_Avg - combined_a1_Avg)^2)),
combined_a2_Std = sqrt((1/n())*sum(a2_Std^2 (a2_Avg - combined_a2_Avg)^2)))
df
Hour combined_a1_Avg combined_a2_Avg combined_a1_Std combined_a2_Std
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.221 -0.0306 0.859 0.859
2 2 0.0672 0.819 1.17 1.17
3 3 0.487 0.782 0.116 0.116
4 4 0.657 -0.957 0.795 0.795
5 5 -0.305 0.620 0.583 0.583
CodePudding user response:
An option is to loop over one set of columns, then get
the other set by replacing the substring in the column names
library(dplyr)
library(stringr)
out2 <- df %>%
mutate(Hour = floor(Hour)) %>%
group_by(Hour) %>%
summarize(across(matches("a\\d _Avg"), ~ mean(.x),
.names = "combined_{col}"),
across(matches('^a\\d _Avg$'),
~ sqrt((1/n())*sum(get(str_replace(cur_column(), "Avg", "Std"))
(. - get(str_c( "combined_", cur_column() )))^2)),
.names = "combined_{str_replace(.col, 'Avg', 'Std')}"))
-checking with OP's manual approach
out1 <- df %>%
mutate(Hour = floor(Hour)) %>%
group_by(Hour) %>%
summarize(across(matches("a._Avg"), ~ mean(.x), .names = "combined_{col}"),
combined_a1_Std = sqrt((1/n())*sum(a1_Std (a1_Avg - combined_a1_Avg)^2)),
combined_a2_Std = sqrt((1/n())*sum(a2_Std (a2_Avg - combined_a2_Avg)^2)))
identical(out1, out2)
[1] TRUE
data
set.seed(1)
df <- data.frame(a1_Avg = rnorm(10),
a1_Std = runif(10),
a2_Avg = rnorm(10),
a2_Std = runif(10),
Hour = c(1.0, 1.5, 2.0, 2.25, 2.5, 2.75, 3.0, 4.0, 4.5, 5.0),
Measurements = c(3, 3, 6, 6, 6, 6, 10, 7, 7, 2))