I didn't see another question quite like this on SO. Let's say I have a dataframe of consisting of 1...m = M variables I want to aggregate over, plus two other grouping variables I plan to aggregate by (m 2 columns total). One of the variables in M needs to be averaged, and the other m-1 should be summed.
I'm familiar with various ways to aggregate data in R, including performing different functions on separate columns. For example, I like this data.table
method:
df <- df[, list(w = sum(w), x = mean(x)), by = c("y", "z")]
which sums w and averages y by two groups, y,z.
After some research, it also looks like there's a dplyr
function across(-x, sum)
that can subject all variables meeting a condition to an aggregation function -- in this case, sum all variables other than x.
I'm having trouble figuring out how to use these together. In my case, I have > 1,000,000 rows and 200 columns, so it's not feasible to cbind()
everything on the LHS of an aggregate()
-type function.
- How might I apply multiple functions (
mean
,sum
) with anacross()
technique? (My instinct is it might involve some merging.) - Is there a
data.table
solution? (Not necessary, but helpful for speed.)
MWE: Consider the iris dataset.
library(datasets)
data(iris)
summary(iris)
which has four numeric columns (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and one factor (Species).
I want to group by species, and within that group, aggregate each of the other four columns. However, Sepal.Length should be averaged (mean
), while Sepal.Width, Petal.Length, Petal.Width should be summed (summed
).
I can do this by naming each variable as follows:
library(data.table)
iris <- data.table(iris)
iris[, list(Sepal.Length = mean(Sepal.Length),
Sepal.Width = sum(Sepal.Width),
Petal.Length = sum(Petal.Length),
Petal.Width = sum(Petal.Width)), by = c("Species")]
How can I generalize the procedure so that it's only necessary to name the exceptions (mean(Sepal.Length)
)?
As an alternative to naming only the exceptions (in the manner of -Sepal.Length
), I could also exhaustively name all the indices; eg:
iris[, list(iris[,1] = mean(iris[,1]),
iris[,2:4] = sum(iris[,2:4])), by = c(Species)]
# doesn't work of course
CodePudding user response:
One of the great things about dplyr::across
is that it doesn't have to be used in isolation.
your_data %>%
group_by(group1, group2) %>%
summarize(across(-col_to_mean, sum), mean_col = mean(col_to_mean))
That works perfectly in my head, but if you want it tested please do provide a MWE.