Home > Software engineering >  Calculate cumulative mean row-wise on groups of columns with a common names pattern
Calculate cumulative mean row-wise on groups of columns with a common names pattern

Time:09-19

Context

I want calculate cumulative mean row-wise, on different sets of columns defined by pattern in the column names.

Example data, with two sets of columns starting with a and b, respectively:

a1 = c(1, 2, 3)
a2 = c(4, 5, 6)
a3 = c(7, 8, 9)
a4 = c(10, 11, 12)
b1 = c(10, 20, 30)
b2 = c(40, 50, 60)
b3 = c(70, 80, 90)
b4 = c(100, 110, 120)

df = data.frame(a1, a2, a3, a4, b1, b2, b3, b4)

> df
  a1 a2 a3 a4 b1 b2 b3  b4
1  1  4  7 10 10 40 70 100
2  2  5  8 11 20 50 80 110
3  3  6  9 12 30 60 90 120

The first set of calculations is performed among the columns where the names start with an a:

a1_2 is the mean value of a1 and a2.

a1_3 is the mean value of a1, a2 and a3.

a1_4 is the mean value of a1, a2, a3 and a4.

Similarly, I want to perform the same calculations on the "b columns": b1_2, b1_3 and b1_4 are calculated in exactly the same way as a1_2, a1_3 and a1_4.

I can generate a1_2 to b1_4 with the following code. But in the real case I have too many similar variables to generate.

library(dplyr)
df %>% 
  rowwise() %>% 
  mutate(a1_2 = mean(c(a1, a2)),
         a1_3 = mean(c(a1, a2, a3)),
         a1_4 = mean(c(a1, a2, a3, a4)),
         b1_2 = mean(c(b1, b2)),
         b1_3 = mean(c(b1, b2, b3)),
         b1_4 = mean(c(b1, b2, b3, b4))) %>% 
  ungroup()

  a1 a2 a3 a4 b1 b2 b3  b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
1  1  4  7 10 10 40 70 100  2.5    4  5.5   25   40   55
2  2  5  8 11 20 50 80 110  3.5    5  6.5   35   50   65
3  3  6  9 12 30 60 90 120  4.5    6  7.5   45   60   75

Question

How can I perform these calculations more efficiently, without having to generate them one by one manually? These generated variables have a pattern, and the pattern is to calculate the average of multiple variables.

What I've done

I checked a question related to mine (Need to create multiple new variables simultaneously using across() in R). But in this question, the new variables generated by the author are not related to the other variables in the data frame, which is not the same as the problem I encountered.

CodePudding user response:

I don't know how to solve this specifically using mutate but I can show you an approach using data.table.

First, I'll explain my way of approaching this:

  1. To start, for columns that start with a, you're looking for mean of (a1, a2), mean of (a1, a2, a3), etc. I'll simplify to just call these mean of (1, 2), mean of (1, 2, 3), etc. From the looks of it, we can re-phrase this as you needing a cumulative mean across groupings of columns.

  2. We can try to use cummean by group but that'll only work for rows.

  3. So we just re-shape our dataset to long format, do a cummean and then re-shape it to its original wide format.

Import packages:

   library(data.table)
   library(dplyr)

Convert your data frame to data.table format:

   setDT(df)

Add a row number to the data frame:

   df[, row_id := .I]

Re-shape to long format:

   df2 = melt.data.table(df, id.vars = "row_id")

At this point, your data looks as follows:

    row_id variable value
 1:      1       a1     1
 2:      2       a1     2
 3:      3       a1     3
 4:      1       a2     4
 5:      2       a2     5
 6:      3       a2     6
 7:      1       a3     7
 8:      2       a3     8
 9:      3       a3     9
10:      1       a4    10
11:      2       a4    11
12:      3       a4    12
13:      1       b1    10
14:      2       b1    20
15:      3       b1    30
16:      1       b2    40
17:      2       b2    50
18:      3       b2    60
19:      1       b3    70
20:      2       b3    80
21:      3       b3    90
22:      1       b4   100
23:      2       b4   110
24:      3       b4   120

Let's split out the letter from the number so that we can create groups by the letter:

   df2[, group := substr(variable, 1, 1)]
   df2[, number := as.numeric(gsub("[[:alpha:]]", "", variable))]

At this point, we have:

    row_id variable value group number
 1:      1       a1     1     a      1
 2:      2       a1     2     a      1
 3:      3       a1     3     a      1
 4:      1       a2     4     a      2
 5:      2       a2     5     a      2
 6:      3       a2     6     a      2
 7:      1       a3     7     a      3
 8:      2       a3     8     a      3
 9:      3       a3     9     a      3
10:      1       a4    10     a      4
11:      2       a4    11     a      4
12:      3       a4    12     a      4
13:      1       b1    10     b      1
14:      2       b1    20     b      1
15:      3       b1    30     b      1
16:      1       b2    40     b      2
17:      2       b2    50     b      2
18:      3       b2    60     b      2
19:      1       b3    70     b      3
20:      2       b3    80     b      3
21:      3       b3    90     b      3
22:      1       b4   100     b      4
23:      2       b4   110     b      4
24:      3       b4   120     b      4

Now, we can take the cumulative mean of value by row_id and group:

   df2[, avg_val :=cummean(value), by=c("row_id", "group")]

We then create your column naming convention as follows:

   df2[, col_name := paste0(group, min(number), "_", number)]

As this point, we have:

    row_id variable value group number avg_val col_name
 1:      1       a1     1     a      1     1.0     a1_1
 2:      2       a1     2     a      1     2.0     a1_1
 3:      3       a1     3     a      1     3.0     a1_1
 4:      1       a2     4     a      2     2.5     a1_2
 5:      2       a2     5     a      2     3.5     a1_2
 6:      3       a2     6     a      2     4.5     a1_2
 7:      1       a3     7     a      3     4.0     a1_3
 8:      2       a3     8     a      3     5.0     a1_3
 9:      3       a3     9     a      3     6.0     a1_3
10:      1       a4    10     a      4     5.5     a1_4
11:      2       a4    11     a      4     6.5     a1_4
12:      3       a4    12     a      4     7.5     a1_4
13:      1       b1    10     b      1    10.0     b1_1
14:      2       b1    20     b      1    20.0     b1_1
15:      3       b1    30     b      1    30.0     b1_1
16:      1       b2    40     b      2    25.0     b1_2
17:      2       b2    50     b      2    35.0     b1_2
18:      3       b2    60     b      2    45.0     b1_2
19:      1       b3    70     b      3    40.0     b1_3
20:      2       b3    80     b      3    50.0     b1_3
21:      3       b3    90     b      3    60.0     b1_3
22:      1       b4   100     b      4    55.0     b1_4
23:      2       b4   110     b      4    65.0     b1_4
24:      3       b4   120     b      4    75.0     b1_4

We can get rid of the rows where number = 1 as these represent the original data. From there, we can cast it to wide format, and merge back the original data:

   df2 = df2[number != 1]
   result = dcast.data.table(df2, row_id ~ col_name, value.var = "avg_val")

   result = merge(df, result, by = "row_id")

   result[, row_id := NULL]

The final result is:

   a1 a2 a3 a4 b1 b2 b3  b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
1:  1  4  7 10 10 40 70 100  2.5    4  5.5   25   40   55
2:  2  5  8 11 20 50 80 110  3.5    5  6.5   35   50   65
3:  3  6  9 12 30 60 90 120  4.5    6  7.5   45   60   75

CodePudding user response:

Here is a dplyr option in one pipe (thanks to @jav for nice approach):

library(dplyr)
library(tidyr)
df %>%
  mutate(id = row_number()) %>%
  pivot_longer(cols = -id) %>%
  mutate(group = sub("^([[:alpha:]]*).*", "\\1", name),
         number = gsub(".*?([0-9] ).*", "\\1", name)) %>%
  group_by(id, group) %>%
  mutate(avg_value = cummean(value),
         col_name := paste0(group, min(number), "_", number)) %>%
  filter(number != 1) %>%
  pivot_wider(id_cols = id, names_from = col_name, values_from = avg_value) %>%
  group_by(id) %>%
  fill(everything(), .direction = "downup") %>%
  slice(1) %>%
  merge(df %>% mutate(id = row_number()), ., by = 'id') %>%
  select(-id)
#>   a1 a2 a3 a4 b1 b2 b3  b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
#> 1  1  4  7 10 10 40 70 100  2.5    4  5.5   25   40   55
#> 2  2  5  8 11 20 50 80 110  3.5    5  6.5   35   50   65
#> 3  3  6  9 12 30 60 90 120  4.5    6  7.5   45   60   75

Created on 2022-09-18 with reprex v2.0.2

CodePudding user response:

Here is another approach in base R that:

  1. defines a small wrapper function applying rowMeans to the sets of columns starting with variable v.
  2. column binds the original data.frame with the output of the wrapper based on the names "a" and "b" (this can be generalized in case of many different variable names).
rowMeansAcc <- function(df, v) {
  m <- as.matrix(df[, grep(v, colnames(df))])
  m_mean <- sapply(1:ncol(m), \(i) rowMeans(m[, 1:i, drop = FALSE]))[, -1, drop = FALSE]
  colnames(m_mean) <- sprintf("%s1_%d", v, 2:ncol(m))
  m_mean  
}

cbind(df, rowMeansAcc(df, "a"), rowMeansAcc(df, "b"))

#>   a1 a2 a3 a4 b1 b2 b3  b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
#> 1  1  4  7 10 10 40 70 100  2.5    4  5.5   25   40   55
#> 2  2  5  8 11 20 50 80 110  3.5    5  6.5   35   50   65
#> 3  3  6  9 12 30 60 90 120  4.5    6  7.5   45   60   75


Note: we could slightly modify the wrapper function to be used directly in e.g. a mutate() call,


rowMeansAcc2 <- function(...) {
  m <- cbind(...)
  m_mean <- sapply(1:ncol(m), \(i) rowMeans(m[, 1:i, drop = FALSE]))[, -1, drop = FALSE]
  colnames(m_mean) <- sprintf("1_%d", 2:ncol(m))
  m_mean  
}

mutate(df, "a" = rowMeansAcc2(a1, a2, a3, a4), "b" = rowMeansAcc2(b1, b2, b3, b4))

#>   a1 a2 a3 a4 b1 b2 b3  b4 a.1_2 a.1_3 a.1_4 b.1_2 b.1_3 b.1_4
#> 1  1  4  7 10 10 40 70 100   2.5   4.0   5.5    25    40    55
#> 2  2  5  8 11 20 50 80 110   3.5   5.0   6.5    35    50    65
#> 3  3  6  9 12 30 60 90 120   4.5   6.0   7.5    45    60    75
  • Related