Creating dummy variables for pairs of observations-CodePudding

My dput is at the end of this prompt.

What I’m trying to do: I want to make two dummy variables. One is countrypair_dummy that is defined as follows. For each group_number, for every unique AB-CD pair in countrypairlist (each row already lists unique pairs, seperated by , , by construction) at each Date, set countrypair_dummy == 1 if countrybirthdate <= Date and countrydeathdate > Date for BOTH countries in the pair.

So, for example:

Country CZ has countrybirthdate == 1990-10-08 and countrydeathdate == 1992-12-31 for group_number == 773. EP has countrybirthdate == 1990-10-08 and countrydeathdate == 2004-05-01 for group_number == 773. Hence, for the country pair CZ-EP and group_number == 773, we have countrydummy == 1 at all Dates observed in the data from 1990-10-08 until 1992-12-31, and countrydummy == 0 otherwise.

The second variable---call it country_dummy is just like countrypair_dummy, except with one difference: country_dummy == 1 if countrybirthdate <= Date and countrydeathdate > Date for EITHER countries in the pair. Hence, for the country pair CZ-EP and group_number == 773, we have country_dummy == 1 for all Date observed between 1990-10-08 until 2004-05-01.

Any help would be greatly appreciated!

PS. I realize the first step is to do df_1 <- separate_rows(df_1, countrypairlist, sep = ", "). After that, I have a hard time trying to reference the pair in R/dplyr.

structure(list(Date = structure(c(7585L, 8338L, 7585L, 8338L, 
7645L, 7585L, 8400L, 12539L, 8338L, 8338L, 7991L, 8595L, 8012L, 
8625L, 8547L, 8012L, 8625L, 8012L, 8533L, 12539L, 8625L, 8124L, 
8124L, 8124L, 8124L), class = c("IDate", "Date")), group_number = c(773L, 
773L, 773L, 773L, 773L, 773L, 773L, 773L, 773L, 773L, 773L, 817L, 
817L, 817L, 817L, 817L, 817L, 817L, 817L, 817L, 817L, 822L, 822L, 
822L, 822L), countrybirthdate = structure(c(7585L, 7585L, 7585L, 
7585L, 7585L, 7585L, 7585L, 7585L, 7585L, 7585L, 7585L, 8012L, 
8012L, 8012L, 8012L, 8012L, 8012L, 8012L, 8012L, 8012L, 8012L, 
8124L, 8124L, 8124L, 8124L), class = c("IDate", "Date")), countrydeathdate = structure(c(12539L, 
8400L, 8400L, 19722L, 8400L, 19722L, 8400L, 12539L, 12539L, 19722L, 
12539L, 19722L, 19722L, 12539L, 12539L, 12539L, 19722L, 19722L, 
19722L, 12539L, 19722L, 19722L, 19722L, 19722L, 19722L), class = c("IDate", 
"Date")), country = c("EP", "CZ", "CZ", "DE", "CZ", "DE", 
"CZ", "EP", "EP", "DE", "EP", "DE", "DE", "EP", "EP", "EP", "DE", 
"CZ", "CZ", "EP", "CZ", "SE", "LT", "LT", "SE"), countrypairlist = c("CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", 
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", 
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", 
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", 
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", 
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", 
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", 
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", 
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", 
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", 
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", 
"LT-SE", "LT-SE", "LT-SE", "LT-SE")), row.names = c(NA, -25L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x000001b8b0a4b6b0>

CodePudding user response：

I think this will get you close to where you need to be. My goal was to generate a data frame with the information for both countries on the same line, which greatly facilitates analysis in dplyr. Note that not all of the countries are represented in the list of birth/death dates.

# make a lookup table of unique group/country birth/death dates
dates <- 
  df %>% 
  select(group_number, country, countrybirthdate, countrydeathdate) %>% 
  unique()

# left_join twice to based on the countrypairlist values
df %>% 
  separate_rows(countrypairlist, sep = ", ") %>% 
  select(Date, group_number, countrypairlist) %>% 
  separate(countrypairlist, c("country", "country.x"), "-") %>% 
  left_join(dates, by = c("country", "group_number")) %>% 
  left_join(dates, by = c("country.x" = "country", "group_number")) %>% 
  mutate(
    countrypair_dummy = if_else(
      countrybirthdate.x <= Date & countrybirthdate.y <= Date & countrydeathdate.x > Date & countrydeathdate.y > Date, 1, 0),
    country_dummy = if_else(
      (countrybirthdate.x <= Date & countrydeathdate.x > Date) | (countrybirthdate.y <= Date & countrydeathdate.y > Date), 1, 0)
    )
  )