Context
I want calculate cumulative mean row-wise, on different sets of columns defined by pattern in the column names.
Example data, with two sets of columns starting with a
and b
, respectively:
a1 = c(1, 2, 3)
a2 = c(4, 5, 6)
a3 = c(7, 8, 9)
a4 = c(10, 11, 12)
b1 = c(10, 20, 30)
b2 = c(40, 50, 60)
b3 = c(70, 80, 90)
b4 = c(100, 110, 120)
df = data.frame(a1, a2, a3, a4, b1, b2, b3, b4)
> df
a1 a2 a3 a4 b1 b2 b3 b4
1 1 4 7 10 10 40 70 100
2 2 5 8 11 20 50 80 110
3 3 6 9 12 30 60 90 120
The first set of calculations is performed among the columns where the names start with an a
:
a1_2
is the mean value of a1
and a2
.
a1_3
is the mean value of a1
, a2
and a3
.
a1_4
is the mean value of a1
, a2
, a3
and a4
.
Similarly, I want to perform the same calculations on the "b
columns": b1_2
, b1_3
and b1_4
are calculated in exactly the same way as a1_2
, a1_3
and a1_4
.
I can generate a1_2
to b1_4
with the following code. But in the real case I have too many similar variables to generate.
library(dplyr)
df %>%
rowwise() %>%
mutate(a1_2 = mean(c(a1, a2)),
a1_3 = mean(c(a1, a2, a3)),
a1_4 = mean(c(a1, a2, a3, a4)),
b1_2 = mean(c(b1, b2)),
b1_3 = mean(c(b1, b2, b3)),
b1_4 = mean(c(b1, b2, b3, b4))) %>%
ungroup()
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
1 1 4 7 10 10 40 70 100 2.5 4 5.5 25 40 55
2 2 5 8 11 20 50 80 110 3.5 5 6.5 35 50 65
3 3 6 9 12 30 60 90 120 4.5 6 7.5 45 60 75
Question
How can I perform these calculations more efficiently, without having to generate them one by one manually? These generated variables have a pattern, and the pattern is to calculate the average of multiple variables.
What I've done
I checked a question related to mine (Need to create multiple new variables simultaneously using across() in R). But in this question, the new variables generated by the author are not related to the other variables in the data frame, which is not the same as the problem I encountered.
CodePudding user response:
I don't know how to solve this specifically using mutate
but I can show you an approach using data.table
.
First, I'll explain my way of approaching this:
To start, for columns that start with
a
, you're looking for mean of(a1, a2)
, mean of(a1, a2, a3)
, etc. I'll simplify to just call these mean of(1, 2)
, mean of(1, 2, 3)
, etc. From the looks of it, we can re-phrase this as you needing acumulative mean
across groupings of columns.We can try to use
cummean
by group but that'll only work for rows.So we just re-shape our dataset to long format, do a
cummean
and then re-shape it to its originalwide
format.
Import packages:
library(data.table)
library(dplyr)
Convert your data frame to data.table
format:
setDT(df)
Add a row number to the data frame:
df[, row_id := .I]
Re-shape to long format:
df2 = melt.data.table(df, id.vars = "row_id")
At this point, your data looks as follows:
row_id variable value
1: 1 a1 1
2: 2 a1 2
3: 3 a1 3
4: 1 a2 4
5: 2 a2 5
6: 3 a2 6
7: 1 a3 7
8: 2 a3 8
9: 3 a3 9
10: 1 a4 10
11: 2 a4 11
12: 3 a4 12
13: 1 b1 10
14: 2 b1 20
15: 3 b1 30
16: 1 b2 40
17: 2 b2 50
18: 3 b2 60
19: 1 b3 70
20: 2 b3 80
21: 3 b3 90
22: 1 b4 100
23: 2 b4 110
24: 3 b4 120
Let's split out the letter from the number so that we can create groups by the letter:
df2[, group := substr(variable, 1, 1)]
df2[, number := as.numeric(gsub("[[:alpha:]]", "", variable))]
At this point, we have:
row_id variable value group number
1: 1 a1 1 a 1
2: 2 a1 2 a 1
3: 3 a1 3 a 1
4: 1 a2 4 a 2
5: 2 a2 5 a 2
6: 3 a2 6 a 2
7: 1 a3 7 a 3
8: 2 a3 8 a 3
9: 3 a3 9 a 3
10: 1 a4 10 a 4
11: 2 a4 11 a 4
12: 3 a4 12 a 4
13: 1 b1 10 b 1
14: 2 b1 20 b 1
15: 3 b1 30 b 1
16: 1 b2 40 b 2
17: 2 b2 50 b 2
18: 3 b2 60 b 2
19: 1 b3 70 b 3
20: 2 b3 80 b 3
21: 3 b3 90 b 3
22: 1 b4 100 b 4
23: 2 b4 110 b 4
24: 3 b4 120 b 4
Now, we can take the cumulative mean of value
by row_id
and group
:
df2[, avg_val :=cummean(value), by=c("row_id", "group")]
We then create your column naming convention as follows:
df2[, col_name := paste0(group, min(number), "_", number)]
As this point, we have:
row_id variable value group number avg_val col_name
1: 1 a1 1 a 1 1.0 a1_1
2: 2 a1 2 a 1 2.0 a1_1
3: 3 a1 3 a 1 3.0 a1_1
4: 1 a2 4 a 2 2.5 a1_2
5: 2 a2 5 a 2 3.5 a1_2
6: 3 a2 6 a 2 4.5 a1_2
7: 1 a3 7 a 3 4.0 a1_3
8: 2 a3 8 a 3 5.0 a1_3
9: 3 a3 9 a 3 6.0 a1_3
10: 1 a4 10 a 4 5.5 a1_4
11: 2 a4 11 a 4 6.5 a1_4
12: 3 a4 12 a 4 7.5 a1_4
13: 1 b1 10 b 1 10.0 b1_1
14: 2 b1 20 b 1 20.0 b1_1
15: 3 b1 30 b 1 30.0 b1_1
16: 1 b2 40 b 2 25.0 b1_2
17: 2 b2 50 b 2 35.0 b1_2
18: 3 b2 60 b 2 45.0 b1_2
19: 1 b3 70 b 3 40.0 b1_3
20: 2 b3 80 b 3 50.0 b1_3
21: 3 b3 90 b 3 60.0 b1_3
22: 1 b4 100 b 4 55.0 b1_4
23: 2 b4 110 b 4 65.0 b1_4
24: 3 b4 120 b 4 75.0 b1_4
We can get rid of the rows where number = 1
as these represent the original data. From there, we can cast it to wide format, and merge back the original data:
df2 = df2[number != 1]
result = dcast.data.table(df2, row_id ~ col_name, value.var = "avg_val")
result = merge(df, result, by = "row_id")
result[, row_id := NULL]
The final result is:
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
1: 1 4 7 10 10 40 70 100 2.5 4 5.5 25 40 55
2: 2 5 8 11 20 50 80 110 3.5 5 6.5 35 50 65
3: 3 6 9 12 30 60 90 120 4.5 6 7.5 45 60 75
CodePudding user response:
Here is a dplyr
option in one pipe (thanks to @jav for nice approach):
library(dplyr)
library(tidyr)
df %>%
mutate(id = row_number()) %>%
pivot_longer(cols = -id) %>%
mutate(group = sub("^([[:alpha:]]*).*", "\\1", name),
number = gsub(".*?([0-9] ).*", "\\1", name)) %>%
group_by(id, group) %>%
mutate(avg_value = cummean(value),
col_name := paste0(group, min(number), "_", number)) %>%
filter(number != 1) %>%
pivot_wider(id_cols = id, names_from = col_name, values_from = avg_value) %>%
group_by(id) %>%
fill(everything(), .direction = "downup") %>%
slice(1) %>%
merge(df %>% mutate(id = row_number()), ., by = 'id') %>%
select(-id)
#> a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
#> 1 1 4 7 10 10 40 70 100 2.5 4 5.5 25 40 55
#> 2 2 5 8 11 20 50 80 110 3.5 5 6.5 35 50 65
#> 3 3 6 9 12 30 60 90 120 4.5 6 7.5 45 60 75
Created on 2022-09-18 with reprex v2.0.2
CodePudding user response:
Here is another approach in base R that:
- defines a small wrapper function applying
rowMeans
to the sets of columns starting with variablev
. - column binds the original data.frame with the output of the wrapper based on the names
"a"
and"b"
(this can be generalized in case of many different variable names).
rowMeansAcc <- function(df, v) {
m <- as.matrix(df[, grep(v, colnames(df))])
m_mean <- sapply(1:ncol(m), \(i) rowMeans(m[, 1:i, drop = FALSE]))[, -1, drop = FALSE]
colnames(m_mean) <- sprintf("%s1_%d", v, 2:ncol(m))
m_mean
}
cbind(df, rowMeansAcc(df, "a"), rowMeansAcc(df, "b"))
#> a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
#> 1 1 4 7 10 10 40 70 100 2.5 4 5.5 25 40 55
#> 2 2 5 8 11 20 50 80 110 3.5 5 6.5 35 50 65
#> 3 3 6 9 12 30 60 90 120 4.5 6 7.5 45 60 75
Note: we could slightly modify the wrapper function to be used directly in e.g. a mutate()
call,
rowMeansAcc2 <- function(...) {
m <- cbind(...)
m_mean <- sapply(1:ncol(m), \(i) rowMeans(m[, 1:i, drop = FALSE]))[, -1, drop = FALSE]
colnames(m_mean) <- sprintf("1_%d", 2:ncol(m))
m_mean
}
mutate(df, "a" = rowMeansAcc2(a1, a2, a3, a4), "b" = rowMeansAcc2(b1, b2, b3, b4))
#> a1 a2 a3 a4 b1 b2 b3 b4 a.1_2 a.1_3 a.1_4 b.1_2 b.1_3 b.1_4
#> 1 1 4 7 10 10 40 70 100 2.5 4.0 5.5 25 40 55
#> 2 2 5 8 11 20 50 80 110 3.5 5.0 6.5 35 50 65
#> 3 3 6 9 12 30 60 90 120 4.5 6.0 7.5 45 60 75