Home > OS >  How to merge multiple data frames into panel data frame?
How to merge multiple data frames into panel data frame?

Time:09-02

I can't imagine this question wasn't asked before, but I spend 2 hours of searching and didn't found anything.

Let's suppose I have 5 separate data frames that contains the same four variables for different years. There is one common variable called 'ID' in each data frame. All data frames are already in the long format (so the observations are listed underneath each other) for further analysis. Each data frame consists of a various amount of observations, so not all IDs are included in every data frame. The aim is to merge these data frames into panel data. Thereby, only complete cases should be considered. If one ID is missing for one year, it should be dropped out.

Here is some example code for better understanding:

#2010
df1=data.frame(ID=c(111,112,113,114),"year"=c(2010, 2010, 2010, 2010),"income"=c(3800, 2200, 1500, 2700),"state"=c("NI", "SH", "BY", "NI"))
df1

ID year income state
1 111 2010   3800    NI
2 112 2010   2200    SH
3 113 2010   1500    BY
4 114 2010   2700    NI

#2011
df2=data.frame(ID=c(112,113,114,115,116),"year"=c(2011, 2011, 2011, 2011, 2011),"income"=c(2300,1500,2500,4200,6000),"state"=c("BY", "BY", "SH", "BY", "HH"))
df2

  ID year income state
1 112 2011   2300    BY
2 113 2011   1500    BY
3 114 2011   2500    SH
4 115 2011   4200    BY
5 116 2011   6000    HH

#2012
df3=data.frame(ID=c(109,112,113,114),"year"=c(2012,2012,2012,2012),"income"=c(1200,2500,1500,3000),"state"=c("BW", "BY", "NI", "SH"))
df3

   ID year income state
1 109 2012   1200    BW
2 112 2012   2500    BY
3 113 2012   1500    NI
4 114 2012   3000    SH

#Desired result
df_final=data.frame(ID=c(112,112,112,113,113,113,114,114,114),"year"=c(2010,2011,2012,2010,2011,2012,2010,2011,2012),"income"=c(2200,2300,2500,1500,1500,1500,2700,2500,300),"state"=c("SH", "BY", "BY", "BY", "BY", "NI", "NI", "SH", "SH"))
df_final

 ID year income state
1 112 2010   2200    SH
2 112 2011   2300    BY
3 112 2012   2500    BY
4 113 2010   1500    BY
5 113 2011   1500    BY
6 113 2012   1500    NI
7 114 2010   2700    NI
8 114 2011   2500    SH
9 114 2012    300    SH

I found a similar question where the panel data function from the reshape2 package was recommended. Although it worked fine, unfortunately it did not exclude cases.

Does anybody know a solution? I appreciate any kind of help.

Thanks in advance!

CodePudding user response:

We can get the datasets in a list, find the common ids with intersect, subset the list elements and rbind in base R

lst1 <- list(df1, df2, df3)
commonIds <- Reduce(intersect, lapply(lst1, `[[`, "ID"))
out <- do.call(rbind, lapply(lst1, subset, subset = ID %in% commonIds))
out <- out[order(out$ID),]
row.names(out) <- NULL

-output

> out
   ID year income state
1 112 2010   2200    SH
2 112 2011   2300    BY
3 112 2012   2500    BY
4 113 2010   1500    BY
5 113 2011   1500    BY
6 113 2012   1500    NI
7 114 2010   2700    NI
8 114 2011   2500    SH
9 114 2012   3000    SH
  • Related