I have the following data frame where the distribution of two expressions (yes and no) are shown for the 2010 to 2020 cohorts.
df <- structure(list(var2kreuz = structure(c(11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L), levels = c("AAA", "BBB", "CCC", "DDD",
"EEE", "FFF",
"GGG", "HHH", "III", "JJJ", "KKK"
), class = "factor"), cohort = structure(c(1L, 1L, 2L, 2L,
3L, 3L, 4L, 4L, 5L, 5L, 9L, 9L, 10L, 10L, 11L, 11L), levels = c("2010",
"2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018",
"2019", "2020"), class = "factor"), var2use = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), levels = c("yes",
"no"), label = c(rsz = "blabla"), class = c("labelled",
"factor")), n = c(10L, 8L, 19L, 13L, 24L, 28L, 19L, 21L, 21L,
16L, 23L, 13L, 38L, 25L, 24L, 28L), proportion = c(0.555555555555556,
0.444444444444444, 0.59375, 0.40625, 0.461538461538462, 0.538461538461538,
0.475, 0.525, 0.567567567567568, 0.432432432432432, 0.638888888888889,
0.361111111111111, 0.603174603174603, 0.396825396825397, 0.461538461538462,
0.538461538461538)), class = c("grouped_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -16L), groups = structure(list(
var2kreuz = structure(c(11L, 11L, 11L, 11L, 11L, 11L, 11L,
11L), levels = c("AAA",
"BBB", "CCC", "DDD",
"EEE", "FFF",
"GGG", "HHH", "III", "JJJ", "KKK"), class = "factor"), cohort = structure(c(1L,
2L, 3L, 4L, 5L, 9L, 10L, 11L), levels = c("2010", "2011",
"2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019",
"2020"), class = "factor"), .rows = structure(list(1:2, 3:4,
5:6, 7:8, 9:10, 11:12, 13:14, 15:16), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .drop = TRUE))
Unfortunately, corresponding values are missing for some cohort
s (here in the example 2015, 2016 and 2017). I am looking for a way to automatically add the missing rows to the dataset, where the contents of the columns n
and proportion
should then be NA
.
Maybe the function complete
from the tidyr
-package could be used here?
CodePudding user response:
You can get the cohort years range and use summarize()
to expand the dataset, then left join back on the orginal:
df<-ungroup(df)
yrs = range(as.numeric(levels(df$cohort)))
unique(df[,c(1,3)]) %>%
group_by(var2kreuz,var2use) %>%
summarize(cohort = factor(yrs[1]:yrs[2])) %>%
left_join(df)
Alternatively, you can use complete()
like this:
df %>% mutate(across(c(var2kreuz, var2use),as.character)) %>%
complete(var2kreuz, var2use,cohort)
Output:
var2kreuz var2use cohort n proportion
1 KKK yes 2010 10 0.5555556
2 KKK yes 2011 19 0.5937500
3 KKK yes 2012 24 0.4615385
4 KKK yes 2013 19 0.4750000
5 KKK yes 2014 21 0.5675676
6 KKK yes 2015 NA NA
7 KKK yes 2016 NA NA
8 KKK yes 2017 NA NA
9 KKK yes 2018 23 0.6388889
10 KKK yes 2019 38 0.6031746
11 KKK yes 2020 24 0.4615385
12 KKK no 2010 8 0.4444444
13 KKK no 2011 13 0.4062500
14 KKK no 2012 28 0.5384615
15 KKK no 2013 21 0.5250000
16 KKK no 2014 16 0.4324324
17 KKK no 2015 NA NA
18 KKK no 2016 NA NA
19 KKK no 2017 NA NA
20 KKK no 2018 13 0.3611111
21 KKK no 2019 25 0.3968254
22 KKK no 2020 28 0.5384615