Home > Software engineering >  "Filling in" Missing Time-Invariant Data After Merging
"Filling in" Missing Time-Invariant Data After Merging

Time:01-20

After merging two time-series datasets I have some missings in a time-invariant variable(value_2) which I would like to complete. However, some units are missing this variable completely, as can be seen, in my explanatory data:

    ID Year value_1 value_2
1   1 2000       a      NA
2   1 2001       b      NA
3   1 2002       r       2
4   1 2004       s       2
5   2 2000       w      NA
6   2 2001       r      NA
7   2 2002       d      NA
8   2 2004       s      NA
9   3 2000       e       3
10  3 2001       f       3
11  3 2002       s      NA
12  3 2004      w2      NA
13  4 2000       e      NA
14  4 2001       r      NA
15  4 2002       e      NA
16  4 2004       w      NA

How can I complete the missing data the best, while keeping the value_2 values for the units without any observation at NA? The output should look like this:

ID Year value_1 value_2
1   1 2000       a       2
2   1 2001       b       2
3   1 2002       r       2
4   1 2004       s       2
5   2 2000       w      NA
6   2 2001       r      NA
7   2 2002       d      NA
8   2 2004       s      NA
9   3 2000       e       3
10  3 2001       f       3
11  3 2002       s       3
12  3 2004      w2       3
13  4 2000       e      NA
14  4 2001       r      NA
15  4 2002       e      NA
16  4 2004       w      NA

CodePudding user response:

You can use tidyr::fill() by groups.

library(dplyr)

df %>%
  group_by(ID) %>%
  tidyr::fill(value_2, .direction = "downup") %>%
  ungroup()

# A tibble: 16 × 4
      ID  Year value_1 value_2
   <dbl> <dbl> <chr>     <dbl>
 1     1  2000 a             2
 2     1  2001 b             2
 3     1  2002 r             2
 4     1  2004 s             2
 5     2  2000 w            NA
 6     2  2001 r            NA
 7     2  2002 d            NA
 8     2  2004 s            NA
 9     3  2000 e             3
10     3  2001 f             3
11     3  2002 s             3
12     3  2004 w2            3
13     4  2000 e            NA
14     4  2001 r            NA
15     4  2002 e            NA
16     4  2004 w            NA

CodePudding user response:

How about this:

library(dplyr)
d <- tibble::tribble(
  ~ID, ~Year, ~value_1, ~value_2,
1, 2000,       "a",      NA,
1, 2001,       "b",      NA,
1, 2002,       "r",       2,
1, 2004,       "s",       2,
2, 2000,       "w",      NA,
2, 2001,       "r",      NA,
2, 2002,       "d",      NA,
2, 2004,       "s",      NA,
3, 2000,       "e",       3,
3, 2001,       "f",       3,
3, 2002,       "s",      NA,
3, 2004,      "w2",      NA,
4, 2000,       "e",      NA,
4, 2001,       "r",      NA,
4, 2002,       "e",      NA,
4, 2004,       "w",      NA)


d %>% group_by(ID) %>% 
  arrange(value_2, .by_group = TRUE) %>% 
  mutate(value_2 = first(value_2)) %>% 
  arrange(ID, Year)
#> # A tibble: 16 × 4
#> # Groups:   ID [4]
#>       ID  Year value_1 value_2
#>    <dbl> <dbl> <chr>     <dbl>
#>  1     1  2000 a             2
#>  2     1  2001 b             2
#>  3     1  2002 r             2
#>  4     1  2004 s             2
#>  5     2  2000 w            NA
#>  6     2  2001 r            NA
#>  7     2  2002 d            NA
#>  8     2  2004 s            NA
#>  9     3  2000 e             3
#> 10     3  2001 f             3
#> 11     3  2002 s             3
#> 12     3  2004 w2            3
#> 13     4  2000 e            NA
#> 14     4  2001 r            NA
#> 15     4  2002 e            NA
#> 16     4  2004 w            NA

Created on 2023-01-19 by the reprex package (v2.0.1)

  • Related