Home > Net >  Concatenating matches in a merge with multiple matches
Concatenating matches in a merge with multiple matches

Time:06-07

I have data as follows:

input_A <- data.frame(ID = c(1,2), some_var = c("bla","more bla"))

input_B <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 
2, 2), year = c(2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 
2003, 2001, 2002, 2003, 2001, 2002, 2003), Type = c("A", "A", 
"A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "C", "C", "C"
), Subtype = c(2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2), 
    Value = c(0.480513615083894, 0.909788893002047, 0.685141970365005, 
    0.138835747632889, 0.899508237239289, 0.535632890739584, 
    0.0712054637209442, 0.655905506366812, 0.694753916517691, 
    0.469249523993816, 0.295044859429007, 0.209906890342936, 
    0.193574644156237, 0.0715219759792846, 0.626529278499682)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -15L))

# A tibble: 15 × 5
      ID  year Type  Subtype  Value
   <dbl> <dbl> <chr>   <dbl>  <dbl>
 1     1  2001 A           2 0.481 
 2     1  2002 A           2 0.910 
 3     1  2003 A           2 0.685 
 4     1  2001 B           1 0.139 
 5     1  2002 B           1 0.900 
 6     1  2003 B           1 0.536 
 7     2  2001 A           1 0.0712
 8     2  2002 A           1 0.656 
 9     2  2003 A           1 0.695 
10     2  2001 B           1 0.469 
11     2  2002 B           1 0.295 
12     2  2003 B           1 0.210 
13     2  2001 C           2 0.194 
14     2  2002 C           2 0.0715
15     2  2003 C           2 0.627

I would like to merge this data. However:

  • Since the combination of ID year has multiple matches, this is not an option for me.
  • Since the combinations of type and subtype in my actual data are really large, also dcast is not an option.

What I would like to do, is to merge the first match by ID year (PREFERABLY THE ONE WITH THE HIGHEST VALUE), then merge the second match by ID year until there are no more matches.

I thought I could use this answer:

inputA[inputB, mult = "first", on = "ID", nomatch=0L]

remove the first matching rows from inputB and run it again until there are no more merges.

It seems a bit messy though and I was wondering if there is a better solution. What would be the best way to go about this?

Desired output:

output <- structure(list(ID = c(1, 1, 1, 2, 2, 2), some_var = c("bla", 
"bla", "bla", "more bla", "more bla", "more bla"), year = c(2001, 
2002, 2003, 2001, 2002, 2003), Type_1 = c("A", "A", "A", "A", 
"A", "A"), Subtype_1 = c(2, 2, 2, 1, 1, 1), Value_1 = c(0.480513615083894, 
0.909788893002047, 0.685141970365005, 0.0712054637209442, 0.655905506366812, 
0.694753916517691), Type_2 = c("B", "B", "B", "B", "B", "B"), 
    Subtype_2 = c(1, 1, 1, 1, 1, 1), Value_2 = c(0.138835747632889, 
    0.899508237239289, 0.535632890739584, 0.469249523993816, 
    0.295044859429007, 0.209906890342936), Type_3 = c(NA, NA, 
    NA, "C", "C", "C"), Subtype_3 = c(NA, NA, NA, 2, 2, 2), Value_3 = c(NA, 
    NA, NA, 0.193574644156237, 0.0715219759792846, 0.626529278499682
    )), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L))

# A tibble: 6 × 12
     ID some_var  year Type_1 Subtype_1 Value_1 Type_2 Subtype_2 Value_2 Type_3 Subtype_3 Value_3
  <dbl> <chr>    <dbl> <chr>      <dbl>   <dbl> <chr>      <dbl>   <dbl> <chr>      <dbl>   <dbl>
1     1 bla       2001 A              2  0.481  B              1   0.139 NA            NA NA     
2     1 bla       2002 A              2  0.910  B              1   0.900 NA            NA NA     
3     1 bla       2003 A              2  0.685  B              1   0.536 NA            NA NA     
4     2 more bla  2001 A              1  0.0712 B              1   0.469 C              2  0.194 
5     2 more bla  2002 A              1  0.656  B              1   0.295 C              2  0.0715
6     2 more bla  2003 A              1  0.695  B              1   0.210 C              2  0.627 

CodePudding user response:

We can use pivot_wider with left_join

library(tidyr)
library(dplyr)
library(data.table)
input_B %>% 
  mutate(rn = rowid(ID, year)) %>%
  pivot_wider(names_from = rn, values_from = c(Type, Subtype, Value)) %>%
  left_join(input_A)

-output

# A tibble: 6 × 12
     ID  year Type_1 Type_2 Type_3 Subtype_1 Subtype_2 Subtype_3 Value_1 Value_2 Value_3 some_var
  <dbl> <dbl> <chr>  <chr>  <chr>      <dbl>     <dbl>     <dbl>   <dbl>   <dbl>   <dbl> <chr>   
1     1  2001 A      B      <NA>           2         1        NA  0.481    0.139 NA      bla     
2     1  2002 A      B      <NA>           2         1        NA  0.910    0.900 NA      bla     
3     1  2003 A      B      <NA>           2         1        NA  0.685    0.536 NA      bla     
4     2  2001 A      B      C              1         1         2  0.0712   0.469  0.194  more bla
5     2  2002 A      B      C              1         1         2  0.656    0.295  0.0715 more bla
6     2  2003 A      B      C              1         1         2  0.695    0.210  0.627  more bla
  • Related