My problem is hopefully easy to understand but nontrivial to put into coding terminology. I will adjust the title for more accurate/general terminology once we figure out what others would look for.
I want to calculate a HHI (Herfindahl index) on a per day basis, that is, take the sum of the squared shares of every industry
by country. Value_country
is the number of id
s by country in the previous year and value_country_industry
is the number of id
s per country and industry in the previous year.
Calculation example: The HHI on 2017/01/03
calculates as (3/5)^2 (industry A) (1/5)^2 (industry B) (1/5)^2 (industry C) = 0.44
where, the denominator equals value_country
at the respective date and the numerator is the most recent value_country_industry
for every industry (i.e. the share, but at the current date. This implies that I cannot just work with the share
column.)
A solution scaling to larger data (and potentially working with NA
s) would be ideal (hence, the data.table
tag).
Example Data
library(data.table)
library(dplyr)
ID <- c("1","2","3","4","5","6")
Date <- c("2017-01-01","2017-01-02", "2017-01-02", "2017-01-02", "2017-01-03","2017-01-02")
Industry <- c("A","A","B","C","A","A")
Country <- c("UK","UK","UK","UK","UK","US")
Value_country<- c(1,4,4,4,5,1)
Value_country_industry<- c(1,2,1,1,3,1)
Share <- c(1,0.5,0.25,0.25,0.6,1)
Desired <- c(1,0.375,0.375,0.375,0.44,1)
dt <- data.frame(id=ID, date=Date, industry=Industry, country=Country, value_country=Value_country, value_country_industry=Value_country_industry, desired_output=Desired)
setDT(dt)[, date := as.Date(date)]
CodePudding user response:
One approach is to cast the data wider in order to more easily account for the "most recent" value_country_industry" rule.
Then fill down and replace any NA's with 0's.
HHI can then be calculated accross the columns, being sure that it will work regardless of how many industries there are.
library(magrittr)
dt_wide <- dt[, -c('desired_output')] %>%
setnames(c('value_country', 'value_country_industry'), c('vc', 'vci')) %>%
dcast(country date ~ industry, fun.aggregate = last, fill = NA,
value.var = c('vc', 'vci'))
vci_cols <- names(dt_wide) %>% .[grepl('vci', .)]
dt_wide[, (vci_cols) := lapply(.SD, nafill, type = 'locf'), by = 'country',
.SDcols = vci_cols] %>%
setnafill(fill = 0L, cols = 3:length(.))
dt_wide[, num := Reduce(' ', .SD ^ 2), .SDcols = patterns('vci_')]
dt_wide[, den := Reduce('pmax', .SD ^ 2), .SDcols = patterns('vc_')]
dt_wide[, hhi := num / den]
dt_wide[, c('num', 'den') := NULL]
dt[dt_wide, hhi := hhi, on = c('country', 'date')]
dt
id date industry country value_country value_country_industry desired_output hhi
1: 1 2017-01-01 A UK 1 1 1.000 1.000
2: 2 2017-01-02 A UK 4 2 0.375 0.375
3: 3 2017-01-02 B UK 4 1 0.375 0.375
4: 4 2017-01-02 C UK 4 1 0.375 0.375
5: 5 2017-01-03 A UK 5 3 0.440 0.440
6: 6 2017-01-02 A US 1 1 1.000 1.000