I can't imagine this question wasn't asked before, but I spend 2 hours of searching and didn't found anything.
Let's suppose I have 5 separate data frames that contains the same four variables for different years. There is one common variable called 'ID' in each data frame. All data frames are already in the long format (so the observations are listed underneath each other) for further analysis. Each data frame consists of a various amount of observations, so not all IDs are included in every data frame. The aim is to merge these data frames into panel data. Thereby, only complete cases should be considered. If one ID is missing for one year, it should be dropped out.
Here is some example code for better understanding:
#2010
df1=data.frame(ID=c(111,112,113,114),"year"=c(2010, 2010, 2010, 2010),"income"=c(3800, 2200, 1500, 2700),"state"=c("NI", "SH", "BY", "NI"))
df1
ID year income state
1 111 2010 3800 NI
2 112 2010 2200 SH
3 113 2010 1500 BY
4 114 2010 2700 NI
#2011
df2=data.frame(ID=c(112,113,114,115,116),"year"=c(2011, 2011, 2011, 2011, 2011),"income"=c(2300,1500,2500,4200,6000),"state"=c("BY", "BY", "SH", "BY", "HH"))
df2
ID year income state
1 112 2011 2300 BY
2 113 2011 1500 BY
3 114 2011 2500 SH
4 115 2011 4200 BY
5 116 2011 6000 HH
#2012
df3=data.frame(ID=c(109,112,113,114),"year"=c(2012,2012,2012,2012),"income"=c(1200,2500,1500,3000),"state"=c("BW", "BY", "NI", "SH"))
df3
ID year income state
1 109 2012 1200 BW
2 112 2012 2500 BY
3 113 2012 1500 NI
4 114 2012 3000 SH
#Desired result
df_final=data.frame(ID=c(112,112,112,113,113,113,114,114,114),"year"=c(2010,2011,2012,2010,2011,2012,2010,2011,2012),"income"=c(2200,2300,2500,1500,1500,1500,2700,2500,300),"state"=c("SH", "BY", "BY", "BY", "BY", "NI", "NI", "SH", "SH"))
df_final
ID year income state
1 112 2010 2200 SH
2 112 2011 2300 BY
3 112 2012 2500 BY
4 113 2010 1500 BY
5 113 2011 1500 BY
6 113 2012 1500 NI
7 114 2010 2700 NI
8 114 2011 2500 SH
9 114 2012 300 SH
I found a similar question where the panel data function from the reshape2 package was recommended. Although it worked fine, unfortunately it did not exclude cases.
Does anybody know a solution? I appreciate any kind of help.
Thanks in advance!
CodePudding user response:
We can get the datasets in a list
, find the common ids with intersect
, subset
the list
elements and rbind
in base R
lst1 <- list(df1, df2, df3)
commonIds <- Reduce(intersect, lapply(lst1, `[[`, "ID"))
out <- do.call(rbind, lapply(lst1, subset, subset = ID %in% commonIds))
out <- out[order(out$ID),]
row.names(out) <- NULL
-output
> out
ID year income state
1 112 2010 2200 SH
2 112 2011 2300 BY
3 112 2012 2500 BY
4 113 2010 1500 BY
5 113 2011 1500 BY
6 113 2012 1500 NI
7 114 2010 2700 NI
8 114 2011 2500 SH
9 114 2012 3000 SH