Home > Software engineering >  data.table sum of all colums by group
data.table sum of all colums by group

Time:12-16

I have a dataframe consisting of 515 integer columns and 2 643 246 rows, from which I would like to subset an unkown number of columns and aggregate the data to a single column showing the sum, by two group-columns.

To do the first part I've used the selection-function from data.table like this, TestData[,c(Kattegori_Henter("Medicine"), "id", "year"), with = FALSE] Where Kattegori_Henter is a function returning the name of the columns I would like to select, from a different dataset. From this selection I then want to do the aggregation.

I have attempted a couple different solutions in data.table to perform this aggregation, without getting a result. Given the intro-data.table vignette I believed the solution would be to add

TestData[,c(Kattegori_Henter("Medicine"), "id", "year"), with = FALSE, lapply(.SD,sum, na.rm = 
         TRUE), by = c(id, year)]

However, this returns the error Provide either by= or keyby= but not both, which I do not understand the meaning of, and without google giving any good results.

I then attempted:

TestData[,c(Kattegori_Henter("Medicine"), "id", "year"), with = FALSE, a := sum(1.ncol), by = c(id, year)]    

Which didn't result in anything at all, other than returning the subsetted dataframe.

The reasoning behind doing this is that I would like to do use lapply on the kattegori_henterfunction, aggregating the 525 collumns into a set of cattegories.

Thanks in advance for all help!

Edit: Attempted

   TestData[,c(Kattegori_Henter("Medicine"), "id", "year"), with =  
   FALSE][, lapply(.SD, sum, na.rm = TRUE), by = c("id", "year")]

As mentioned in the comments. The results was the same as the 2nd code above, returning an unchanged dataframe.

Edit 2: Removed this from the question, due to a comment on it not producing the wanted results: ", which would be equal to the tidyverse-code:

Test2 %>% 
group_by(id, year) %>% 
summarise(a = sum(1:ncol(.), na.rm = TRUE)) "

CodePudding user response:

I think the code you're looking for is likely:

TestData[, .(a = sum(.SD)), by = .(id, year), .SDcols = Kattegori_Henter("Medicine")]
  • Related