I have two dataframes. df1 with several hundred thousand rows each indicating a parliamentary speech. These have columns that indicate the date of the speech, the party, speaker etc.
df1 looks like this:
date speechnumber speaker party party.facts.id chair terms parliament iso3country year pyear
1 1997-10-07 3 Poul Nyrup Rasmussen S 379 FALSE 18662 DK-Folketing DNK 1997 1
2 1997-10-09 3 Torben Lund S 379 FALSE 2865 DK-Folketing DNK 1997 1
3 1997-10-09 5 Frank Aaen EL 1527 FALSE 159 DK-Folketing DNK 1997 1
4 1997-10-09 6 Pia Kjærsgaard DF 1022 FALSE 195 DK-Folketing DNK 1997 1
5 1997-10-09 7 Kristian Thulesen Dahl DF 1022 FALSE 202 DK-Folketing DNK 1997 1
6 1997-10-09 10 Jette Gottlieb EL 1527 FALSE 179 DK-Folketing DNK 1997 1
Then I have created another one (df2) where I have specified which parties formed government at what times. I have made this into long format (not sure whether this is best), where 1 indicates that party x was in government in the period from "start_date" to "end_date".
df2 looks like this:
start_date end_date party values
1 1996-12-30 1998-03-23 RV 1
2 1998-03-23 2001-11-27 RV 1
3 2001-11-27 2005-02-18 RV 0
4 2005-02-18 2007-11-23 RV 0
5 2007-11-23 2009-04-05 RV 0
6 2009-04-05 2011-10-03 RV 0
7 2011-10-03 2014-02-03 RV 1
8 2014-02-03 2015-06-28 RV 1
9 2015-06-28 2016-11-28 RV 0
10 2016-11-28 2019-06-27 RV 0
11 2019-06-27 2020-10-27 RV 0
12 1996-12-30 1998-03-23 S 1
13 1998-03-23 2001-11-27 S 1
14 2001-11-27 2005-02-18 S 0
15 2005-02-18 2007-11-23 S 0
16 2007-11-23 2009-04-05 S 0
17 2009-04-05 2011-10-03 S 0
18 2011-10-03 2014-02-03 S 1
19 2014-02-03 2015-06-28 S 1
20 2015-06-28 2016-11-28 S 0
21 2016-11-28 2019-06-27 S 0
22 2019-06-27 2020-10-27 S 1
23 1996-12-30 1998-03-23 V 0
24 1998-03-23 2001-11-27 V 0
25 2001-11-27 2005-02-18 V 1
26 2005-02-18 2007-11-23 V 1
27 2007-11-23 2009-04-05 V 1
28 2009-04-05 2011-10-03 V 1
29 2011-10-03 2014-02-03 V 0
30 2014-02-03 2015-06-28 V 0
31 2015-06-28 2016-11-28 V 1
32 2016-11-28 2019-06-27 V 1
33 2019-06-27 2020-10-27 V 0
34 1996-12-30 1998-03-23 KF 0
35 1998-03-23 2001-11-27 KF 0
36 2001-11-27 2005-02-18 KF 1
37 2005-02-18 2007-11-23 KF 1
38 2007-11-23 2009-04-05 KF 1
39 2009-04-05 2011-10-03 KF 1
40 2011-10-03 2014-02-03 KF 0
41 2014-02-03 2015-06-28 KF 0
42 2015-06-28 2016-11-28 KF 0
43 2016-11-28 2019-06-27 KF 1
44 2019-06-27 2020-10-27 KF 0
45 1996-12-30 1998-03-23 SF 0
46 1998-03-23 2001-11-27 SF 0
47 2001-11-27 2005-02-18 SF 0
48 2005-02-18 2007-11-23 SF 0
49 2007-11-23 2009-04-05 SF 0
50 2009-04-05 2011-10-03 SF 0
51 2011-10-03 2014-02-03 SF 1
52 2014-02-03 2015-06-28 SF 0
53 2015-06-28 2016-11-28 SF 0
54 2016-11-28 2019-06-27 SF 0
55 2019-06-27 2020-10-27 SF 0
56 1996-12-30 1998-03-23 LA 0
57 1998-03-23 2001-11-27 LA 0
58 2001-11-27 2005-02-18 LA 0
59 2005-02-18 2007-11-23 LA 0
60 2007-11-23 2009-04-05 LA 0
61 2009-04-05 2011-10-03 LA 0
62 2011-10-03 2014-02-03 LA 0
63 2014-02-03 2015-06-28 LA 0
64 2015-06-28 2016-11-28 LA 0
65 2016-11-28 2019-06-27 LA 1
66 2019-06-27 2020-10-27 LA 0
What I want is to create a dummy variable in df1 indicating whether party x was in government at the time of the speech. I have tried so many different ways but I cannot seem to figure it out. Can anyone help?
CodePudding user response:
library(data.table)
setDT(df1); setDT(df2)
# set dates to actual date-format (if not already)
df1[, date := as.Date(date)]
df2[, start_date := as.Date(start_date)]
df2[, end_date := as.Date(end_date)]
# initialise the 'ingov' column to "no"
df1[, ingov := "no"]
df1[df2[values == 1,],
ingov := "yes",
on = .(party, date >= start_date, date < end_date)]
# date speaker party year pyear ingov
# 1: 1997-10-07 Poul Nyrup Rasmussen S 1997 1 yes
# 2: 1997-10-09 Torben Lund S 1997 1 yes
# 3: 1997-10-09 Frank Aaen EL 1997 1 no
# 4: 1997-10-09 Pia Kjærsgaard DF 1997 1 no
# 5: 1997-10-09 Kristian Thulesen Dahl DF 1997 1 no
# 6: 1997-10-09 Jette Gottlieb EL 1997 1 no