Home > OS >  R groupby or aggregate on big df
R groupby or aggregate on big df

Time:02-25

I'm not understanding how to groupby on a large df in R

Columns 0-12 are identifiers, unique, and I would like to leave them as is

I've tried a number of variations of this aggregate(cbind(names(preferences[-c(0,12)]))~cbind(names(preferences[c(0,12)])),data=preferences,FUN = sum)

I'm getting Error in model.frame.default(formula = cbind(names(preferences[-c(0, 12)])) ~ : variable lengths differ (found for 'cbind(names(preferences[c(0, 12)]))')

a  b     c   d   e
1  f(1)  11  2   15
1  f(1)  12  2   15
2  f(2)  13  4   3
2  f(2)  14  6   4
3  f(3)  15  5   6

a  b     c   d   e
1  f(1)  23  4   30
2  f(2)  27  10  7
3  f(3)  15  5   6

Python equivalent df[11:624].groupby(by=col11)

df is 48GB so speed matters (python crashes due to a lack of memory(250GB))

After receiving an answer I went and looked at some benchmarks and this is fast as heck!!!!

CodePudding user response:

library(data.table)

setDT(df)

x <- names(df)[13:ncol(df)]

y <- names(df)[1:12]

df_2 <- df[, lapply(.SD, \(i) sum(i)), .SDcols=x, by=y]

Though be aware of indexing in R vs Python. R starts counting from 1 (whereas Python has zero indexing)

  • Related