I have monthly values for a factor variable Alphabet
(containing factorlevels X, Y and Z) for 2021. See tibble DF
below.
Some observations only start occuring later in the year. For example, recordings of X only start in june 2021.
My problem is that the source of the data omits values if they are zero. So for example I have values for Y in January (Sum is 2), but the data on for February is omitted by the source since there are no values recorded. Then March Y has a Sum
of 5 so it's back.
What I want to do is complete this dataset. So for Y, I want a row that says feb 2021, Y, 0. I tried using the complete
function for this.
complete(Month2, nesting (Alphabet), fill = list( Sum=0))
However, this also completes the values for X for the whole year, which is incorrect. X only exists since june 2021. See tibble DF_Overcomplete
below.
My solution is to have complete function conditional per factor level. So only after it has encountered a Sum
that is not zero, it should start to complete and populate the Month2
, Alphabat
, and Sum
variables. For Y it can complete the whole year (since January has a Sum !=0), but for X I only want to apply the complete function from June.
DF:
structure(list(Month2 = structure(c(6L, 7L, 9L, 1L, 3L, 6L, 8L,
9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c("jan 2021",
"feb 2021", "mrt 2021", "apr 2021", "mei 2021", "jun 2021", "jul 2021",
"aug 2021", "sep 2021", "okt 2021"), class = "factor"), Alphabet = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L), .Label = c("X", "Y", "Z"), class = "factor"), Sum = c(1L,
1L, 2L, 2L, 5L, 10L, 4L, 2L, 2L, 2L, 9L, 24L, 46L, 71L, 91L,
97L, 54L, 57L, 81L)), row.names = c(NA, -19L), class = c("tbl_df",
"tbl", "data.frame"))
DF_Overcomplete:
structure(list(Month2 = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c("jan 2021", "feb 2021",
"mrt 2021", "apr 2021", "mei 2021", "jun 2021", "jul 2021", "aug 2021",
"sep 2021", "okt 2021"), class = "factor"), Alphabet = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("X",
"Y", "Z"), class = "factor"), Sum = c(0, 0, 0, 0, 0, 1, 1, 0,
2, 0, 2, 0, 5, 0, 0, 10, 0, 4, 2, 2, 2, 9, 24, 46, 71, 91, 97,
54, 57, 81)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-30L))
So basically I want this:
Thanks in advance!
CodePudding user response:
First change the factor Month2 to a Date so it's easily ordered. Then calculate the min and max month for each Alphabet. Use the complete as before, join and filter by the min and max.
library('dplyr')
library('tidyr')
df <- df %>% mutate(yr_mo = as.Date(paste('01', Month2), format = '%d %b %Y'))
start_end <- df %>% group_by(Alphabet) %>% summarise(min(yr_mo), max(yr_mo))
df_complete <- df %>%
complete(Month2, nesting(Alphabet), fill = list(Sum = 0L)) %>%
mutate(yr_mo = as.Date(paste('01', Month2), format = '%d %b %Y')) %>%
left_join(start_end) %>%
filter(yr_mo >= `min(yr_mo)`, yr_mo <= `max(yr_mo)`) %>%
select(-c(`min(yr_mo)`, `max(yr_mo)`, yr_mo))
df_complete %>% print(n = Inf)
# A tibble: 24 x 3
Month2 Alphabet Sum
<fct> <fct> <int>
1 jan 2021 Y 2
2 jan 2021 Z 2
3 feb 2021 Y 0
4 feb 2021 Z 9
5 mrt 2021 Y 5
6 mrt 2021 Z 24
7 apr 2021 Y 0
8 apr 2021 Z 46
9 mei 2021 Y 0
10 mei 2021 Z 71
11 jun 2021 X 1
12 jun 2021 Y 10
13 jun 2021 Z 91
14 jul 2021 X 1
15 jul 2021 Y 0
16 jul 2021 Z 97
17 aug 2021 X 0
18 aug 2021 Y 4
19 aug 2021 Z 54
20 sep 2021 X 2
21 sep 2021 Y 2
22 sep 2021 Z 57
23 okt 2021 Y 2
24 okt 2021 Z 81