I have data as follows:
input_A <- data.frame(ID = c(1,2), some_var = c("bla","more bla"))
input_B <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2), year = c(2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002,
2003, 2001, 2002, 2003, 2001, 2002, 2003), Type = c("A", "A",
"A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "C", "C", "C"
), Subtype = c(2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2),
Value = c(0.480513615083894, 0.909788893002047, 0.685141970365005,
0.138835747632889, 0.899508237239289, 0.535632890739584,
0.0712054637209442, 0.655905506366812, 0.694753916517691,
0.469249523993816, 0.295044859429007, 0.209906890342936,
0.193574644156237, 0.0715219759792846, 0.626529278499682)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -15L))
# A tibble: 15 × 5
ID year Type Subtype Value
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 2001 A 2 0.481
2 1 2002 A 2 0.910
3 1 2003 A 2 0.685
4 1 2001 B 1 0.139
5 1 2002 B 1 0.900
6 1 2003 B 1 0.536
7 2 2001 A 1 0.0712
8 2 2002 A 1 0.656
9 2 2003 A 1 0.695
10 2 2001 B 1 0.469
11 2 2002 B 1 0.295
12 2 2003 B 1 0.210
13 2 2001 C 2 0.194
14 2 2002 C 2 0.0715
15 2 2003 C 2 0.627
I would like to merge this data. However:
- Since the combination of
ID year
has multiple matches, this is not an option for me. - Since the combinations of type and subtype in my actual data are really large, also
dcast
is not an option.
What I would like to do, is to merge the first match by ID year
(PREFERABLY THE ONE WITH THE HIGHEST VALUE), then merge the second match by ID year
until there are no more matches.
I thought I could use this answer:
inputA[inputB, mult = "first", on = "ID", nomatch=0L]
remove the first matching rows from inputB
and run it again until there are no more merges.
It seems a bit messy though and I was wondering if there is a better solution. What would be the best way to go about this?
Desired output:
output <- structure(list(ID = c(1, 1, 1, 2, 2, 2), some_var = c("bla",
"bla", "bla", "more bla", "more bla", "more bla"), year = c(2001,
2002, 2003, 2001, 2002, 2003), Type_1 = c("A", "A", "A", "A",
"A", "A"), Subtype_1 = c(2, 2, 2, 1, 1, 1), Value_1 = c(0.480513615083894,
0.909788893002047, 0.685141970365005, 0.0712054637209442, 0.655905506366812,
0.694753916517691), Type_2 = c("B", "B", "B", "B", "B", "B"),
Subtype_2 = c(1, 1, 1, 1, 1, 1), Value_2 = c(0.138835747632889,
0.899508237239289, 0.535632890739584, 0.469249523993816,
0.295044859429007, 0.209906890342936), Type_3 = c(NA, NA,
NA, "C", "C", "C"), Subtype_3 = c(NA, NA, NA, 2, 2, 2), Value_3 = c(NA,
NA, NA, 0.193574644156237, 0.0715219759792846, 0.626529278499682
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L))
# A tibble: 6 × 12
ID some_var year Type_1 Subtype_1 Value_1 Type_2 Subtype_2 Value_2 Type_3 Subtype_3 Value_3
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 1 bla 2001 A 2 0.481 B 1 0.139 NA NA NA
2 1 bla 2002 A 2 0.910 B 1 0.900 NA NA NA
3 1 bla 2003 A 2 0.685 B 1 0.536 NA NA NA
4 2 more bla 2001 A 1 0.0712 B 1 0.469 C 2 0.194
5 2 more bla 2002 A 1 0.656 B 1 0.295 C 2 0.0715
6 2 more bla 2003 A 1 0.695 B 1 0.210 C 2 0.627
CodePudding user response:
We can use pivot_wider
with left_join
library(tidyr)
library(dplyr)
library(data.table)
input_B %>%
mutate(rn = rowid(ID, year)) %>%
pivot_wider(names_from = rn, values_from = c(Type, Subtype, Value)) %>%
left_join(input_A)
-output
# A tibble: 6 × 12
ID year Type_1 Type_2 Type_3 Subtype_1 Subtype_2 Subtype_3 Value_1 Value_2 Value_3 some_var
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 2001 A B <NA> 2 1 NA 0.481 0.139 NA bla
2 1 2002 A B <NA> 2 1 NA 0.910 0.900 NA bla
3 1 2003 A B <NA> 2 1 NA 0.685 0.536 NA bla
4 2 2001 A B C 1 1 2 0.0712 0.469 0.194 more bla
5 2 2002 A B C 1 1 2 0.656 0.295 0.0715 more bla
6 2 2003 A B C 1 1 2 0.695 0.210 0.627 more bla