Home > Back-end >  combine two datasets with different dimension by ID
combine two datasets with different dimension by ID

Time:03-14

I have two datasets:

df1:
    ID score
1   1    30
2   1    10
3   1    22
4   2    44
5   2     6
6   3     5
7   3    20
8   4    35
9   4     2
10  4    60
11  5    14
12  5     5

df2:
    ID para1 para2
1   1    10     5
2   1    10     5
3   2    20    10
4   2    20    10
5   3    30    15
6   4    40    20
7   4    40    20
8   4    40    20
9   4    40    20
10  5    50    25
11  5    50    25
12  5    50    25
13  6    60    30
14  6    60    30

I would like to combine df1 and df2 by ID and get df3 below. Tried merge and left_join but they don't work well as I probably missed something. Any simple way to get this?

df3:
    ID score para1 para2
1   1    30    10     5
2   1    10    10     5
3   1    22    10     5
4   2    44    20    10
5   2     6    20    10
6   3     5    30    15
7   3    20    30    15
8   4    35    40    20
9   4     2    40    20
10  4    60    40    20
11  5    14    50    25
12  5     5    50    25

CodePudding user response:

One option to achieve your desired result would be to first get rid of the duplicated rows in your df2 using e.g. dplyr::distinct:

library(dplyr)

df1 %>% 
  left_join(distinct(df2, ID, para1, para2))
#> Joining, by = "ID"
#>    ID score para1 para2
#> 1   1    30    10     5
#> 2   1    10    10     5
#> 3   1    22    10     5
#> 4   2    44    20    10
#> 5   2     6    20    10
#> 6   3     5    30    15
#> 7   3    20    30    15
#> 8   4    35    40    20
#> 9   4     2    40    20
#> 10  4    60    40    20
#> 11  5    14    50    25
#> 12  5     5    50    25

DATA

df1 <- structure(list(ID = c(
  1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L,
  5L, 5L
), score = c(
  30L, 10L, 22L, 44L, 6L, 5L, 20L, 35L, 2L,
  60L, 14L, 5L
)), class = "data.frame", row.names = c(
  "1", "2",
  "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"
))

df2 <- structure(list(ID = c(
  1L, 1L, 2L, 2L, 3L, 4L, 4L, 4L, 4L, 5L,
  5L, 5L, 6L, 6L
), para1 = c(
  10L, 10L, 20L, 20L, 30L, 40L, 40L,
  40L, 40L, 50L, 50L, 50L, 60L, 60L
), para2 = c(
  5L, 5L, 10L, 10L,
  15L, 20L, 20L, 20L, 20L, 25L, 25L, 25L, 30L, 30L
)), class = "data.frame", row.names = c(
  "1",
  "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
  "14"
))

CodePudding user response:

Another approach using data.table would be to do a join and keep the first matched row in the second data.frame.

In this case, take the subset of df2, where that key column's values match df1's key column's values based on ID key.

You can also include a nomatch argument to fill in a value if there's no match. See ?data.table for more details.

library(data.table)

setDT(df1)
setDT(df2)

df2[df1, mult = "first", on = "ID"]

Output

    ID para1 para2 score
 1:  1    10     5    30
 2:  1    10     5    10
 3:  1    10     5    22
 4:  2    20    10    44
 5:  2    20    10     6
 6:  3    30    15     5
 7:  3    30    15    20
 8:  4    40    20    35
 9:  4    40    20     2
10:  4    40    20    60
11:  5    50    25    14
12:  5    50    25     5
  • Related