I have two data frames that I want to merge. Both of them contain information about people, per id
& year
.
One of them is the "main", and the other adds information. However, I can't merge them in the regular way (i.e., merge()
or dplyr::left_join()
) because the year
values in them don't necessarily match per id
. So I want to chronologically carry over from what is known from the second table into the per year
rows in the main table.
In the following example, I have two tables about army officers. The "main" one has 3 columns for id
, year
, and another col_1
:
df_main_info <-
tribble(~id, ~year, ~col_1,
1, 2008, "foo",
1, 2005, "bar",
1, 2010, "blah",
1, 2020, "bar",
2, 1999, "foo",
2, 2020, "foo",
3, 2002, "bar",
3, 2010, "bar",
4, 2003, "foo",
4, 2010, "bar"
)
I have an additional table with id
and year
columns, for when each officer got their rank, and what rank it was:
df_ranks_history <-
tribble(~id, ~year, ~army_rank,
1, 2005, "second_lieutenant",
1, 2010, "first_lieutenant",
1, 2018, "major",
1, 2021, "colonel",
2, 2002, "major",
2, 2018, "colonel",
3, 1995, "second_lieutenant",
3, 2000, "captain",
3, 2012, "colonel"
)
The years don't match strictly. But if, for example, officer id = 3
became "captain"
in 2000, then we know that in 2002 it was still the case, so we can enter "captain" into df_main_info
in row 7.
The desired output should therefore be:
desired_output <-
tribble(~id, ~year, ~col_1, ~army_rank,
1, 2008, "foo", "second_lieutenant",
1, 2005, "bar", "second_lieutenant",
1, 2010, "blah", "first_lieutenant",
1, 2020, "bar", "major",
2, 1999, "foo", NA,
2, 2020, "foo", "colonel",
3, 2002, "bar", "captain",
3, 2010, "bar", "captain",
4, 2003, "foo", NA,
4, 2010, "bar", NA
)
In case this is relevant, the ranks go in a certain order:
us_army_officer_ranks <- c("second_lieutenant",
"first_lieutenant",
"captain",
"major",
"lieutenant_colonel",
"colonel")
# colonel > lieutenant_colonel > major > captain > first_lieutenant > second_lieutenant
CodePudding user response:
library(dplyr)
library(tidyr)
df_main_info %>%
full_join(df_ranks_history, by = c("id", "year")) %>%
group_by(id) %>%
arrange(id, year) %>%
fill(army_rank, .direction = "down") %>%
filter(!is.na(col_1))
# # A tibble: 10 × 4
# # Groups: id [4]
# id year col_1 army_rank
# <dbl> <dbl> <chr> <chr>
# 1 1 2005 bar second_lieutenant
# 2 1 2008 foo second_lieutenant
# 3 1 2010 blah first_lieutenant
# 4 1 2020 bar major
# 5 2 1999 foo NA
# 6 2 2020 foo colonel
# 7 3 2002 bar captain
# 8 3 2010 bar captain
# 9 4 2003 foo NA
# 10 4 2010 bar NA
CodePudding user response:
library(data.table)
setDT(df_main_info)
setDT(df_ranks_history)
df_ranks_history[df_main_info, on = list(id, year), roll = Inf]
id year army_rank col_1
1: 1 2008 second_lieutenant foo
2: 1 2005 second_lieutenant bar
3: 1 2010 first_lieutenant blah
4: 1 2020 major bar
5: 2 1999 <NA> foo
6: 2 2020 colonel foo
7: 3 2002 captain bar
8: 3 2010 captain bar
9: 4 2003 <NA> foo
10: 4 2010 <NA> bar