Two datasets, each with a unit variable [firmID], time variable [year)] and unique variable each (a numeric [revenue] and a binary [Director]) are merged using inner_join
on both unit and time variables. See below code:
library(PanelMatch)
library(dplyr)
Example1 <- structure(list(firmID = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
3, 3, 3, 3),
year = c(2000, 2001, 2001, 2002, 2003, 2004, 2005,
2006, 2000, 2001, 2003, 2004, 2003, 2004, 2005, 2008),
revenue = c(1024, 874, 874, 638, 606, 500, 1025, 1400, 2056, 1500, 1300, 2301, 1578, 560, 1204, 2300)),
row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"))
Example1
Example2 <- structure(list(firmID = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 2, 3, 3, 3, 3), year = c(2000, 2001, 2002, 2003, 2004, 2005,
2006, 2007, 2001, 2002, 2003, 2004, 2005, 2006, 2000, 2001, 2002,
2003),
`Director(binary)` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1)),
row.names = c(NA, -18L), class = c("tbl_df", "tbl", "data.frame"))
Example2
innerJoined <- inner_join(Example1,Example2, by = c("firmID", "year"))
innerJoined
To calculate ATT, I wish to panel match this merged dataset with the below code:
innerJoined <- as.data.frame(innerJoined)
innerJoined$year <- as.integer(innerJoined$year)
innerJoined$gvkey <- as.integer(innerJoined$firmID)
innerJoined$log_revt <- log(innerJoined$revenue)
innerJoined <- as.data.frame(innerJoined)
PM <- PanelMatch(lag = 5,
time.id = "year", unit.id = "firmID",
treatment = "Directory(binary)",
refinement.method = "none",
data = innerJoined,
qoi = "att" ,
outcome.var = "log_revenue",
lead = 0:4)
This produces an error reading: "Error in panel_match(lag, time.id, unit.id, treatment, refinement.method,: Time, unit combinations should uniquely identify rows. Please remove duplicates"
What are these duplicates referring to? And what can I do to remove the duplicates?
PS. The other joining methods (left_join, right_join, full_join) remove/add data incorrectly and so these are not useful for me.
Thanks for any suggestions!
CodePudding user response:
Here’s a simple strategy to identify duplicate panel identifiers, using the paste()
and the duplicated()
functions:
library(PanelMatch)
library(dplyr)
Example1 <- structure(list(firmID = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), year = c(2000, 2001, 2001, 2002, 2003, 2004, 2005, 2006, 2000, 2001, 2003, 2004, 2003, 2004, 2005, 2008), revenue = c(1024, 874, 874, 638, 606, 500, 1025, 1400, 2056, 1500, 1300, 2301, 1578, 560, 1204, 2300)), row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"))
Example2 <- structure(list(firmID = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3), year = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2001, 2002, 2003, 2004, 2005, 2006, 2000, 2001, 2002, 2003), `Director(binary)` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1)), row.names = c(NA, -18L), class = c("tbl_df", "tbl", "data.frame"))
innerJoined <- inner_join(Example1,Example2, by = c("firmID", "year"))
innerJoined <- as.data.frame(innerJoined)
innerJoined$year <- as.integer(innerJoined$year)
innerJoined$gvkey <- as.integer(innerJoined$firmID)
innerJoined$log_revt <- log(innerJoined$revenue)
innerJoined <- as.data.frame(innerJoined)
PM <- PanelMatch(lag = 5,
time.id = "year", unit.id = "firmID",
treatment = "Directory(binary)",
refinement.method = "none",
data = innerJoined,
qoi = "att" ,
outcome.var = "log_revenue",
lead = 0:4)
#> Error in panel_match(lag, time.id, unit.id, treatment, refinement.method, : Time, unit combinations should uniquely identify rows. Please remove duplicates
idx <- paste(innerJoined$firmID, innerJoined$year)
idx[duplicated(idx)]
#> [1] "1 2001"
In this example, there are two rows with firmId==1 and year==2001.
CodePudding user response:
Here is a data.table solution for identifying duplicates.
library(data.table)
setDT(innerJoined)[, .N, by=.(firmID, year)][order(-N)]
## firmID year N
## 1: 1 2001 2
## 2: 1 2000 1
## 3: 1 2002 1
## 4: 1 2003 1
## ...
The first clause:setDT(innerJoined)
converts your data.frame to a data.table. The second clause: [, .N, by=.(firmID, year)]
calculates the number of rows in every combination of firmID and year. The third clause: [order(-N)]
sorts the result by decreasing N, so the replicates are first.