When I panel match two merged datasets (inner_join), an error appears telling me to remove duplicate-CodePudding

Two datasets, each with a unit variable [firmID], time variable [year)] and unique variable each (a numeric [revenue] and a binary [Director]) are merged using inner_join on both unit and time variables. See below code:

library(PanelMatch)
library(dplyr)


Example1 <- structure(list(firmID = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 
                          3, 3, 3, 3), 
               year = c(2000, 2001, 2001, 2002, 2003, 2004, 2005, 
                                                2006, 2000, 2001, 2003, 2004, 2003, 2004, 2005, 2008), 
               revenue = c(1024, 874, 874, 638, 606, 500, 1025, 1400, 2056, 1500, 1300, 2301, 1578, 560, 1204, 2300)), 
          row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"))
Example1

Example2 <- structure(list(firmID = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 
                          2, 2, 3, 3, 3, 3), year = c(2000, 2001, 2002, 2003, 2004, 2005, 
                                                      2006, 2007, 2001, 2002, 2003, 2004, 2005, 2006, 2000, 2001, 2002, 
                                                      2003), 
               `Director(binary)` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1)), 
          row.names = c(NA, -18L), class = c("tbl_df", "tbl", "data.frame"))
Example2

innerJoined <- inner_join(Example1,Example2, by = c("firmID", "year"))
innerJoined

To calculate ATT, I wish to panel match this merged dataset with the below code:

innerJoined <- as.data.frame(innerJoined)

innerJoined$year <- as.integer(innerJoined$year)

innerJoined$gvkey <- as.integer(innerJoined$firmID)

innerJoined$log_revt <- log(innerJoined$revenue)

innerJoined <- as.data.frame(innerJoined)

PM <- PanelMatch(lag = 5,
                   time.id = "year", unit.id = "firmID", 
                   treatment = "Directory(binary)", 
                   refinement.method = "none",
                   data = innerJoined, 
                   qoi = "att" , 
                   outcome.var = "log_revenue", 
                   lead = 0:4)

This produces an error reading: "Error in panel_match(lag, time.id, unit.id, treatment, refinement.method,: Time, unit combinations should uniquely identify rows. Please remove duplicates"

What are these duplicates referring to? And what can I do to remove the duplicates?

PS. The other joining methods (left_join, right_join, full_join) remove/add data incorrectly and so these are not useful for me.

Thanks for any suggestions!

CodePudding user response：

Here’s a simple strategy to identify duplicate panel identifiers, using the paste() and the duplicated() functions:

library(PanelMatch)
library(dplyr)

Example1 <- structure(list(firmID = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), year = c(2000, 2001, 2001, 2002, 2003, 2004, 2005, 2006, 2000, 2001, 2003, 2004, 2003, 2004, 2005, 2008), revenue = c(1024, 874, 874, 638, 606, 500, 1025, 1400, 2056, 1500, 1300, 2301, 1578, 560, 1204, 2300)), row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"))
Example2 <- structure(list(firmID = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3), year = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2001, 2002, 2003, 2004, 2005, 2006, 2000, 2001, 2002, 2003), `Director(binary)` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1)), row.names = c(NA, -18L), class = c("tbl_df", "tbl", "data.frame"))
innerJoined <- inner_join(Example1,Example2, by = c("firmID", "year"))
innerJoined <- as.data.frame(innerJoined)
innerJoined$year <- as.integer(innerJoined$year)
innerJoined$gvkey <- as.integer(innerJoined$firmID)
innerJoined$log_revt <- log(innerJoined$revenue)
innerJoined <- as.data.frame(innerJoined)

PM <- PanelMatch(lag = 5,
                  time.id = "year", unit.id = "firmID", 
                  treatment = "Directory(binary)", 
                   refinement.method = "none",
                   data = innerJoined, 
                   qoi = "att" , 
                   outcome.var = "log_revenue", 
                   lead = 0:4)
#> Error in panel_match(lag, time.id, unit.id, treatment, refinement.method, : Time, unit combinations should uniquely identify rows. Please remove duplicates

idx <- paste(innerJoined$firmID, innerJoined$year)
idx[duplicated(idx)]
#> [1] "1 2001"

In this example, there are two rows with firmId==1 and year==2001.

CodePudding user response：

Here is a data.table solution for identifying duplicates.

library(data.table)
setDT(innerJoined)[, .N, by=.(firmID, year)][order(-N)]

##    firmID year N
## 1:      1 2001 2
## 2:      1 2000 1
## 3:      1 2002 1
## 4:      1 2003 1
## ...

The first clause:setDT(innerJoined) converts your data.frame to a data.table. The second clause: [, .N, by=.(firmID, year)] calculates the number of rows in every combination of firmID and year. The third clause: [order(-N)] sorts the result by decreasing N, so the replicates are first.