My dput is at the end of this prompt.
What I’m trying to do: I want to make two dummy variables. One is countrypair_dummy
that is defined as follows. For each group_number
, for every unique AB-CD
pair in countrypairlist
(each row already lists unique pairs, seperated by ,
, by construction) at each Date
, set countrypair_dummy == 1
if countrybirthdate <= Date
and countrydeathdate > Date
for BOTH countries in the pair.
So, for example:
Country CZ
has countrybirthdate == 1990-10-08
and countrydeathdate == 1992-12-31
for group_number == 773
. EP
has countrybirthdate == 1990-10-08
and countrydeathdate == 2004-05-01
for group_number == 773
. Hence, for the country pair CZ-EP
and group_number == 773
, we have countrydummy == 1
at all Date
s observed in the data from 1990-10-08
until 1992-12-31
, and countrydummy == 0
otherwise.
The second variable---call it country_dummy
is just like countrypair_dummy
, except with one difference: country_dummy == 1
if countrybirthdate <= Date
and countrydeathdate > Date
for EITHER countries in the pair. Hence, for the country pair CZ-EP
and group_number == 773
, we have country_dummy == 1
for all Date
observed between 1990-10-08
until 2004-05-01
.
Any help would be greatly appreciated!
PS. I realize the first step is to do df_1 <- separate_rows(df_1, countrypairlist, sep = ", ")
. After that, I have a hard time trying to reference the pair in R/dplyr.
structure(list(Date = structure(c(7585L, 8338L, 7585L, 8338L,
7645L, 7585L, 8400L, 12539L, 8338L, 8338L, 7991L, 8595L, 8012L,
8625L, 8547L, 8012L, 8625L, 8012L, 8533L, 12539L, 8625L, 8124L,
8124L, 8124L, 8124L), class = c("IDate", "Date")), group_number = c(773L,
773L, 773L, 773L, 773L, 773L, 773L, 773L, 773L, 773L, 773L, 817L,
817L, 817L, 817L, 817L, 817L, 817L, 817L, 817L, 817L, 822L, 822L,
822L, 822L), countrybirthdate = structure(c(7585L, 7585L, 7585L,
7585L, 7585L, 7585L, 7585L, 7585L, 7585L, 7585L, 7585L, 8012L,
8012L, 8012L, 8012L, 8012L, 8012L, 8012L, 8012L, 8012L, 8012L,
8124L, 8124L, 8124L, 8124L), class = c("IDate", "Date")), countrydeathdate = structure(c(12539L,
8400L, 8400L, 19722L, 8400L, 19722L, 8400L, 12539L, 12539L, 19722L,
12539L, 19722L, 19722L, 12539L, 12539L, 12539L, 19722L, 19722L,
19722L, 12539L, 19722L, 19722L, 19722L, 19722L, 19722L), class = c("IDate",
"Date")), country = c("EP", "CZ", "CZ", "DE", "CZ", "DE",
"CZ", "EP", "EP", "DE", "EP", "DE", "DE", "EP", "EP", "EP", "DE",
"CZ", "CZ", "EP", "CZ", "SE", "LT", "LT", "SE"), countrypairlist = c("CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK",
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK",
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK",
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK",
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK",
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK",
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK",
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK",
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK",
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK",
"CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK", "CZ-EP, CZ-DE, CZ-SK, EP-DE, EP-SK, DE-SK",
"LT-SE", "LT-SE", "LT-SE", "LT-SE")), row.names = c(NA, -25L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x000001b8b0a4b6b0>
CodePudding user response:
I think this will get you close to where you need to be. My goal was to generate a data frame with the information for both countries on the same line, which greatly facilitates analysis in dplyr
. Note that not all of the countries are represented in the list of birth/death dates.
# make a lookup table of unique group/country birth/death dates
dates <-
df %>%
select(group_number, country, countrybirthdate, countrydeathdate) %>%
unique()
# left_join twice to based on the countrypairlist values
df %>%
separate_rows(countrypairlist, sep = ", ") %>%
select(Date, group_number, countrypairlist) %>%
separate(countrypairlist, c("country", "country.x"), "-") %>%
left_join(dates, by = c("country", "group_number")) %>%
left_join(dates, by = c("country.x" = "country", "group_number")) %>%
mutate(
countrypair_dummy = if_else(
countrybirthdate.x <= Date & countrybirthdate.y <= Date & countrydeathdate.x > Date & countrydeathdate.y > Date, 1, 0),
country_dummy = if_else(
(countrybirthdate.x <= Date & countrydeathdate.x > Date) | (countrybirthdate.y <= Date & countrydeathdate.y > Date), 1, 0)
)
)