Home > Blockchain >  Merging two dataframes by multiple columns without losing data
Merging two dataframes by multiple columns without losing data

Time:10-08

I am new to R and have two very large datasets I want to merge. They look as follows

      ID     year   val1  val3
 1     1     2001      2   34 
 2     2     2004      1   25
 3     3     2003      3   36
 4     4     2003      2   46
 5     5     1999      1   55
 6     6     2005      3   44

The second dataframe is as follows

      ID     year   val2 
 1     1     2001      2
 2     2     2004      1
 3     3     2003      3
 4     4     2002      5
 5     5     1998      4
 6     6     2004      6

I want the final merged set to look like this

      ID     year   val1  val3 val2
 1     1     2001      2   34     2
 2     2     2004      1   25     1
 3     3     2003      3   36     3
 4     4     2002     NA   NA     5
 5     4     2003      2   46    NA
 6     5     1998     NA   NA     4
 7     5     1999      1   55    NA
 8     6     2004     NA   NA     6
 9     6     2005      3   44    NA

I tried merging by ID and year using the following

 total <- merge(df1,df2,by=c("id","year"))

But this results in only merging if ID and year BOTH match. I want to it to happen so that if the ID matches but year doesn't match, a new row will add in the same ID the entry for year and val2 while leaving val1 and val3 as NA.

I then tried merging only by ID and then removing rows if year.x != year.y, but since the datasets were too large it wasn't very efficient.

CodePudding user response:

merge has an argument all that specifies if you want to keep all rows from left and right side (i.e. all rows from x and all rows from y)

 total <- merge(df1,df2,by=c("id","year"), all=TRUE)

CodePudding user response:

You must specify all.x=TRUE and all.y=TRUE so you keep all unique rows from both datasets

 total <- merge(df1,df2,by=c("id","year"),all.x=TRUE,all.y=TRUE)
  • Related