Merging data frames by id while interweaving years and carry values forward between years-CodePudding

I have two data frames that I want to merge. Both of them contain information about people, per id & year.

One of them is the "main", and the other adds information. However, I can't merge them in the regular way (i.e., merge() or dplyr::left_join()) because the year values in them don't necessarily match per id. So I want to chronologically carry over from what is known from the second table into the per year rows in the main table.

In the following example, I have two tables about army officers. The "main" one has 3 columns for id, year, and another col_1:

df_main_info <-
  tribble(~id, ~year, ~col_1, 
          1,   2008,  "foo",
          1,   2005,  "bar",
          1,   2010,  "blah", 
          1,   2020,  "bar",  
          2,   1999,  "foo", 
          2,   2020,  "foo",  
          3,   2002,  "bar",
          3,   2010,  "bar",
          4,   2003,  "foo",
          4,   2010,  "bar"
  )

I have an additional table with id and year columns, for when each officer got their rank, and what rank it was:

df_ranks_history <-
  tribble(~id, ~year, ~army_rank,
          1,   2005,  "second_lieutenant",
          1,   2010,  "first_lieutenant",
          1,   2018,  "major",
          1,   2021,  "colonel",
          2,   2002,  "major",
          2,   2018,  "colonel",
          3,   1995,  "second_lieutenant",
          3,   2000,  "captain",
          3,   2012,  "colonel"
  )

The years don't match strictly. But if, for example, officer id = 3 became "captain" in 2000, then we know that in 2002 it was still the case, so we can enter "captain" into df_main_info in row 7.

The desired output should therefore be:


desired_output <-
  tribble(~id, ~year, ~col_1, ~army_rank,
          1,   2008,  "foo",   "second_lieutenant",
          1,   2005,  "bar",   "second_lieutenant",
          1,   2010,  "blah",  "first_lieutenant",
          1,   2020,  "bar",   "major",
          2,   1999,  "foo",   NA,
          2,   2020,  "foo",   "colonel",
          3,   2002,  "bar",   "captain",
          3,   2010,  "bar",   "captain",
          4,   2003,  "foo",   NA,
          4,   2010,  "bar",   NA
          )

In case this is relevant, the ranks go in a certain order:

us_army_officer_ranks <- c("second_lieutenant", 
                           "first_lieutenant", 
                           "captain", 
                           "major", 
                           "lieutenant_colonel", 
                           "colonel")
# colonel > lieutenant_colonel > major > captain > first_lieutenant > second_lieutenant

CodePudding user response：

library(dplyr)
library(tidyr)

df_main_info %>% 
  full_join(df_ranks_history, by = c("id", "year")) %>%
  group_by(id) %>%
  arrange(id, year) %>%
  fill(army_rank, .direction = "down") %>%
  filter(!is.na(col_1))
# # A tibble: 10 × 4
# # Groups:   id [4]
#       id  year col_1 army_rank        
#    <dbl> <dbl> <chr> <chr>            
#  1     1  2005 bar   second_lieutenant
#  2     1  2008 foo   second_lieutenant
#  3     1  2010 blah  first_lieutenant 
#  4     1  2020 bar   major            
#  5     2  1999 foo   NA               
#  6     2  2020 foo   colonel          
#  7     3  2002 bar   captain          
#  8     3  2010 bar   captain          
#  9     4  2003 foo   NA               
# 10     4  2010 bar   NA

CodePudding user response：

library(data.table)

setDT(df_main_info)
setDT(df_ranks_history)

df_ranks_history[df_main_info, on = list(id, year), roll =  Inf]

    id year         army_rank col_1
 1:  1 2008 second_lieutenant   foo
 2:  1 2005 second_lieutenant   bar
 3:  1 2010  first_lieutenant  blah
 4:  1 2020             major   bar
 5:  2 1999              <NA>   foo
 6:  2 2020           colonel   foo
 7:  3 2002           captain   bar
 8:  3 2010           captain   bar
 9:  4 2003              <NA>   foo
10:  4 2010              <NA>   bar