Home > Enterprise >  Sum ratios of previous rows values over current row value in a different column by groups in R data.
Sum ratios of previous rows values over current row value in a different column by groups in R data.

Time:11-07

My problem is hopefully easy to understand but nontrivial to put into coding terminology. I will adjust the title for more accurate/general terminology once we figure out what others would look for. I want to calculate a HHI (Herfindahl index) on a per day basis, that is, take the sum of the squared shares of every industry by country. Value_country is the number of ids by country in the previous year and value_country_industry is the number of ids per country and industry in the previous year.

Calculation example: The HHI on 2017/01/03 calculates as (3/5)^2 (industry A) (1/5)^2 (industry B) (1/5)^2 (industry C) = 0.44 where, the denominator equals value_country at the respective date and the numerator is the most recent value_country_industry for every industry (i.e. the share, but at the current date. This implies that I cannot just work with the share column.)

A solution scaling to larger data (and potentially working with NAs) would be ideal (hence, the data.table tag).


Example Data

library(data.table)
library(dplyr)
ID    <- c("1","2","3","4","5","6")
Date <- c("2017-01-01","2017-01-02", "2017-01-02", "2017-01-02", "2017-01-03","2017-01-02")
Industry <- c("A","A","B","C","A","A")
Country <- c("UK","UK","UK","UK","UK","US")
Value_country<- c(1,4,4,4,5,1)
Value_country_industry<- c(1,2,1,1,3,1)
Share <- c(1,0.5,0.25,0.25,0.6,1)
Desired <- c(1,0.375,0.375,0.375,0.44,1)

dt <- data.frame(id=ID, date=Date, industry=Industry, country=Country, value_country=Value_country, value_country_industry=Value_country_industry, desired_output=Desired)
setDT(dt)[, date := as.Date(date)]

CodePudding user response:

One approach is to cast the data wider in order to more easily account for the "most recent" value_country_industry" rule.
Then fill down and replace any NA's with 0's.
HHI can then be calculated accross the columns, being sure that it will work regardless of how many industries there are.

library(magrittr)
dt_wide <- dt[, -c('desired_output')] %>% 
  setnames(c('value_country', 'value_country_industry'), c('vc', 'vci')) %>% 
  dcast(country   date ~ industry, fun.aggregate = last, fill = NA, 
        value.var = c('vc', 'vci'))
vci_cols <- names(dt_wide) %>% .[grepl('vci', .)]
dt_wide[, (vci_cols) := lapply(.SD, nafill, type = 'locf'), by = 'country', 
        .SDcols = vci_cols] %>% 
  setnafill(fill = 0L, cols = 3:length(.))
dt_wide[, num := Reduce(' ', .SD ^ 2), .SDcols = patterns('vci_')]
dt_wide[, den := Reduce('pmax', .SD ^ 2), .SDcols = patterns('vc_')]
dt_wide[, hhi := num / den]
dt_wide[, c('num', 'den') := NULL]
dt[dt_wide, hhi := hhi, on = c('country', 'date')]
dt
   id       date industry country value_country value_country_industry desired_output   hhi
1:  1 2017-01-01        A      UK             1                      1          1.000 1.000
2:  2 2017-01-02        A      UK             4                      2          0.375 0.375
3:  3 2017-01-02        B      UK             4                      1          0.375 0.375
4:  4 2017-01-02        C      UK             4                      1          0.375 0.375
5:  5 2017-01-03        A      UK             5                      3          0.440 0.440
6:  6 2017-01-02        A      US             1                      1          1.000 1.000
  • Related