Home > Blockchain >  Aggregate by averaging m-1 columns and summing remainder
Aggregate by averaging m-1 columns and summing remainder

Time:10-18

I didn't see another question quite like this on SO. Let's say I have a dataframe of consisting of 1...m = M variables I want to aggregate over, plus two other grouping variables I plan to aggregate by (m 2 columns total). One of the variables in M needs to be averaged, and the other m-1 should be summed.

I'm familiar with various ways to aggregate data in R, including performing different functions on separate columns. For example, I like this data.table method:

df <- df[, list(w = sum(w), x = mean(x)), by = c("y", "z")]

which sums w and averages y by two groups, y,z.

After some research, it also looks like there's a dplyr function across(-x, sum) that can subject all variables meeting a condition to an aggregation function -- in this case, sum all variables other than x.

I'm having trouble figuring out how to use these together. In my case, I have > 1,000,000 rows and 200 columns, so it's not feasible to cbind() everything on the LHS of an aggregate()-type function.

  • How might I apply multiple functions (mean, sum) with an across() technique? (My instinct is it might involve some merging.)
  • Is there a data.table solution? (Not necessary, but helpful for speed.)

MWE: Consider the iris dataset.

library(datasets)
data(iris)
summary(iris)

which has four numeric columns (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and one factor (Species).

I want to group by species, and within that group, aggregate each of the other four columns. However, Sepal.Length should be averaged (mean), while Sepal.Width, Petal.Length, Petal.Width should be summed (summed).

I can do this by naming each variable as follows:

library(data.table)
iris <- data.table(iris)
iris[, list(Sepal.Length = mean(Sepal.Length), 
            Sepal.Width = sum(Sepal.Width),
            Petal.Length = sum(Petal.Length),
            Petal.Width = sum(Petal.Width)), by = c("Species")]

How can I generalize the procedure so that it's only necessary to name the exceptions (mean(Sepal.Length))?

As an alternative to naming only the exceptions (in the manner of -Sepal.Length), I could also exhaustively name all the indices; eg:

iris[, list(iris[,1] = mean(iris[,1]), 
            iris[,2:4] = sum(iris[,2:4])), by = c(Species)] 
# doesn't work of course

CodePudding user response:

One of the great things about dplyr::across is that it doesn't have to be used in isolation.

your_data %>% 
  group_by(group1, group2) %>%
  summarize(across(-col_to_mean, sum), mean_col = mean(col_to_mean))

That works perfectly in my head, but if you want it tested please do provide a MWE.

  • Related