How to add missing rows to a data frame-CodePudding

I have the following data frame where the distribution of two expressions (yes and no) are shown for the 2010 to 2020 cohorts.

df <- structure(list(var2kreuz = structure(c(11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L), levels = c("AAA", "BBB", "CCC", "DDD", 
"EEE", "FFF", 
"GGG", "HHH", "III", "JJJ", "KKK"
), class = "factor"), cohort = structure(c(1L, 1L, 2L, 2L, 
3L, 3L, 4L, 4L, 5L, 5L, 9L, 9L, 10L, 10L, 11L, 11L), levels = c("2010", 
"2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", 
"2019", "2020"), class = "factor"), var2use = structure(c(1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), levels = c("yes", 
"no"), label = c(rsz = "blabla"), class = c("labelled", 
"factor")), n = c(10L, 8L, 19L, 13L, 24L, 28L, 19L, 21L, 21L, 
16L, 23L, 13L, 38L, 25L, 24L, 28L), proportion = c(0.555555555555556, 
0.444444444444444, 0.59375, 0.40625, 0.461538461538462, 0.538461538461538, 
0.475, 0.525, 0.567567567567568, 0.432432432432432, 0.638888888888889, 
0.361111111111111, 0.603174603174603, 0.396825396825397, 0.461538461538462, 
0.538461538461538)), class = c("grouped_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -16L), groups = structure(list(
    var2kreuz = structure(c(11L, 11L, 11L, 11L, 11L, 11L, 11L, 
    11L), levels = c("AAA", 
"BBB", "CCC", "DDD", 
"EEE", "FFF", 
"GGG", "HHH", "III", "JJJ", "KKK"), class = "factor"), cohort = structure(c(1L, 
    2L, 3L, 4L, 5L, 9L, 10L, 11L), levels = c("2010", "2011", 
    "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", 
    "2020"), class = "factor"), .rows = structure(list(1:2, 3:4, 
        5:6, 7:8, 9:10, 11:12, 13:14, 15:16), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .drop = TRUE))

Unfortunately, corresponding values are missing for some cohorts (here in the example 2015, 2016 and 2017). I am looking for a way to automatically add the missing rows to the dataset, where the contents of the columns n and proportion should then be NA.

Maybe the function complete from the tidyr-package could be used here?

CodePudding user response：

You can get the cohort years range and use summarize() to expand the dataset, then left join back on the orginal:

df<-ungroup(df)

yrs = range(as.numeric(levels(df$cohort)))
unique(df[,c(1,3)]) %>% 
  group_by(var2kreuz,var2use) %>% 
  summarize(cohort = factor(yrs[1]:yrs[2])) %>% 
  left_join(df)

Alternatively, you can use complete() like this:

df %>% mutate(across(c(var2kreuz, var2use),as.character)) %>% 
  complete(var2kreuz, var2use,cohort)

Output:

   var2kreuz var2use cohort  n proportion
1        KKK     yes   2010 10  0.5555556
2        KKK     yes   2011 19  0.5937500
3        KKK     yes   2012 24  0.4615385
4        KKK     yes   2013 19  0.4750000
5        KKK     yes   2014 21  0.5675676
6        KKK     yes   2015 NA         NA
7        KKK     yes   2016 NA         NA
8        KKK     yes   2017 NA         NA
9        KKK     yes   2018 23  0.6388889
10       KKK     yes   2019 38  0.6031746
11       KKK     yes   2020 24  0.4615385
12       KKK      no   2010  8  0.4444444
13       KKK      no   2011 13  0.4062500
14       KKK      no   2012 28  0.5384615
15       KKK      no   2013 21  0.5250000
16       KKK      no   2014 16  0.4324324
17       KKK      no   2015 NA         NA
18       KKK      no   2016 NA         NA
19       KKK      no   2017 NA         NA
20       KKK      no   2018 13  0.3611111
21       KKK      no   2019 25  0.3968254
22       KKK      no   2020 28  0.5384615