Home > Net >  Sum shares/number of rows according to date by groups in R data.table/frame
Sum shares/number of rows according to date by groups in R data.table/frame

Time:11-12

I would like to calculate the squared sum of occurences (number of rows respectively) of the unique values of group A (industry) by group B (country) over the previous year.

Calculation example row 5: 2x A 1x B 1x C = 2^2 1^2 ^ 1^2 = 6 (does not include the A from row 1 because it is older than a year and also not include the A from row 6 because it is in another country).

I manage to calculate the numbers by row but I am failing to move this to the aggregated date level:

dt[, count_by_industry:= sapply(date, function(x) length(industry[between(date, x - lubridate::years(1), x)])), 
    by = c("country", "industry")]

The solution ideally scales to real data with ~2mn rows and around 10k dates and group elements (hence the data.table tag).


Example Data

ID    <- c("1","2","3","4","5","6")
Date <- c("2016-01-02","2017-01-01", "2017-01-03", "2017-01-03", "2017-01-04","2017-01-03")
Industry <- c("A","A","B","C","A","A")
Country <- c("UK","UK","UK","UK","UK","US")
Desired <- c(1,4,3,3,6,1)

library(data.table)
dt <- data.frame(id=ID, date=Date, industry=Industry, country=Country, desired_output=Desired)
setDT(dt)[, date := as.Date(date)]

CodePudding user response:

Adapting from your start:

dt[, output:= sapply(date, function(x) sum(table(industry[between(date, x - lubridate::years(1), x)]) ^ 2)), 
   by = c("country")]
dt
   id       date industry country desired_output output
1:  1 2016-01-02        A      UK              1      1
2:  2 2017-01-01        A      UK              4      4
3:  3 2017-01-03        B      UK              3      3
4:  4 2017-01-03        C      UK              3      3
5:  5 2017-01-04        A      UK              6      6
6:  6 2017-01-03        A      US              1      1
  • Related