I have two datasets:
df1:
ID score
1 1 30
2 1 10
3 1 22
4 2 44
5 2 6
6 3 5
7 3 20
8 4 35
9 4 2
10 4 60
11 5 14
12 5 5
df2:
ID para1 para2
1 1 10 5
2 1 10 5
3 2 20 10
4 2 20 10
5 3 30 15
6 4 40 20
7 4 40 20
8 4 40 20
9 4 40 20
10 5 50 25
11 5 50 25
12 5 50 25
13 6 60 30
14 6 60 30
I would like to combine df1 and df2 by ID and get df3 below. Tried merge and left_join but they don't work well as I probably missed something. Any simple way to get this?
df3:
ID score para1 para2
1 1 30 10 5
2 1 10 10 5
3 1 22 10 5
4 2 44 20 10
5 2 6 20 10
6 3 5 30 15
7 3 20 30 15
8 4 35 40 20
9 4 2 40 20
10 4 60 40 20
11 5 14 50 25
12 5 5 50 25
CodePudding user response:
One option to achieve your desired result would be to first get rid of the duplicated rows in your df2
using e.g. dplyr::distinct
:
library(dplyr)
df1 %>%
left_join(distinct(df2, ID, para1, para2))
#> Joining, by = "ID"
#> ID score para1 para2
#> 1 1 30 10 5
#> 2 1 10 10 5
#> 3 1 22 10 5
#> 4 2 44 20 10
#> 5 2 6 20 10
#> 6 3 5 30 15
#> 7 3 20 30 15
#> 8 4 35 40 20
#> 9 4 2 40 20
#> 10 4 60 40 20
#> 11 5 14 50 25
#> 12 5 5 50 25
DATA
df1 <- structure(list(ID = c(
1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L,
5L, 5L
), score = c(
30L, 10L, 22L, 44L, 6L, 5L, 20L, 35L, 2L,
60L, 14L, 5L
)), class = "data.frame", row.names = c(
"1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12"
))
df2 <- structure(list(ID = c(
1L, 1L, 2L, 2L, 3L, 4L, 4L, 4L, 4L, 5L,
5L, 5L, 6L, 6L
), para1 = c(
10L, 10L, 20L, 20L, 30L, 40L, 40L,
40L, 40L, 50L, 50L, 50L, 60L, 60L
), para2 = c(
5L, 5L, 10L, 10L,
15L, 20L, 20L, 20L, 20L, 25L, 25L, 25L, 30L, 30L
)), class = "data.frame", row.names = c(
"1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14"
))
CodePudding user response:
Another approach using data.table
would be to do a join and keep the first matched row in the second data.frame.
In this case, take the subset of df2
, where that key column's values match df1
's key column's values based on ID
key.
You can also include a nomatch
argument to fill in a value if there's no match. See ?data.table
for more details.
library(data.table)
setDT(df1)
setDT(df2)
df2[df1, mult = "first", on = "ID"]
Output
ID para1 para2 score
1: 1 10 5 30
2: 1 10 5 10
3: 1 10 5 22
4: 2 20 10 44
5: 2 20 10 6
6: 3 30 15 5
7: 3 30 15 20
8: 4 40 20 35
9: 4 40 20 2
10: 4 40 20 60
11: 5 50 25 14
12: 5 50 25 5