I have this dataframe:
df <- tibble(company = c(rep('Google', 15), rep('Facebook', 15)),
date = c(
rep('2020END', 5),
rep('2021Q1', 5),
rep('2021Q2', 5),
rep('2020END', 5),
rep('2021Q1', 5),
rep('2021Q2', 5)
),
item = c(rep(c('a','b','c','d','e'), 6)),
values = c(c(10,20,30,40,50), c(1,2, NA, NA, 4),
c(1,2,3,NA,NA),
c(1,NA,5,NA,3), c(10,20,NA,40,NA),
c(1,2,NA,NA,NA)))
And i have the following function which fills in missing values by grouping by company and fills the missing values by the values column in a downward approach.
filler <- function(.x){
.x <- .x %>%
group_by(item) %>%
fill(values, .direction = "down")
}
The reason for this is that I would like any missing data to be filled by the previous quarter up until the value of END for the previous year. This is translated as the following logic:
- If 2021Q1 value is missing, then fill the missing value from 2020END
- If 2021Q2 value is missing, fill from 2021Q1, if missing fill from 2020END
- If 2021Q3 value is missing, fill from 2021Q2, if missing fill from 2021Q1, if missing fill from 2020END
- If 2021Q4 value is missing, fill from 2021Q3, if missing fill from 2021Q2, if missing fill from 2021Q1, if missing fill from 2020END
So the logic cycles backwards to the previous years' end value which is always in the dataframe.
So far i have the following code which does this:
df %>%
group_by(company) %>%
group_split() %>%
map(.x=., .f = filler)
How can i create a column which keeps track of the date to which the missing values were filled from?
e.g. have something such as paste0 of the date
column which is used to say where that value came from?
e.g. this table:
company | date | item | value | date_filled_from |
---|---|---|---|---|
2020END | a | 10 | 2020END | |
... | ... | ... | ... | ... |
2021Q1 | c | 20 | 2020END |
As from above it kept the date to which google & 2020Q1 was taken from during filling.
CodePudding user response:
here is a data.table
approach using a rolling join on rownumbers (assuming your data is already in chronological order)
library(data.table)
# set to data.table format
setDT(df)
# assuming the data is already ordered, create rownumbers
# for rolling to
df[, rn := .I]
# set join on all rows wit NA-values,
# with rolling baack to the previous rn with a non-NA value
df[is.na(values), c("filled_value", "filled_from") :=
df[!is.na(values), ][df[is.na(values), ],
.(values, date),
on = .(company, item, rn),
roll = Inf]]
output
company date item values rn filled_value filled_from
1: Google 2020END a 10 1 NA <NA>
2: Google 2020END b 20 2 NA <NA>
3: Google 2020END c 30 3 NA <NA>
4: Google 2020END d 40 4 NA <NA>
5: Google 2020END e 50 5 NA <NA>
6: Google 2021Q1 a 1 6 NA <NA>
7: Google 2021Q1 b 2 7 NA <NA>
8: Google 2021Q1 c NA 8 30 2020END
9: Google 2021Q1 d NA 9 40 2020END
10: Google 2021Q1 e 4 10 NA <NA>
11: Google 2021Q2 a 1 11 NA <NA>
12: Google 2021Q2 b 2 12 NA <NA>
13: Google 2021Q2 c 3 13 NA <NA>
14: Google 2021Q2 d NA 14 40 2020END
15: Google 2021Q2 e NA 15 4 2021Q1
16: Facebook 2020END a 1 16 NA <NA>
17: Facebook 2020END b NA 17 NA <NA>
18: Facebook 2020END c 5 18 NA <NA>
19: Facebook 2020END d NA 19 NA <NA>
20: Facebook 2020END e 3 20 NA <NA>
21: Facebook 2021Q1 a 10 21 NA <NA>
22: Facebook 2021Q1 b 20 22 NA <NA>
23: Facebook 2021Q1 c NA 23 5 2020END
24: Facebook 2021Q1 d 40 24 NA <NA>
25: Facebook 2021Q1 e NA 25 3 2020END
26: Facebook 2021Q2 a 1 26 NA <NA>
27: Facebook 2021Q2 b 2 27 NA <NA>
28: Facebook 2021Q2 c NA 28 5 2020END
29: Facebook 2021Q2 d NA 29 40 2021Q1
30: Facebook 2021Q2 e NA 30 3 2020END
company date item values rn filled_value filled_from
as you can see, Facebook 2020END b
and Facebook 2020END d
sitll have a NA-value, since thay have no preceeding value in values
. You can easily flip the join above to roll to -Inf
, so you will roll to the nearest next value of values.
df[is.na(filled_value) & is.na(values), c("filled_value", "filled_from") :=
df[!(is.na(filled_value) & is.na(values)), ][df[is.na(filled_value) & is.na(values), ],
.(values, date),
on = .(company, item, rn),
roll = -Inf]]
resulting in
company date item values rn filled_value filled_from
1: Google 2020END a 10 1 NA <NA>
2: Google 2020END b 20 2 NA <NA>
3: Google 2020END c 30 3 NA <NA>
4: Google 2020END d 40 4 NA <NA>
5: Google 2020END e 50 5 NA <NA>
6: Google 2021Q1 a 1 6 NA <NA>
7: Google 2021Q1 b 2 7 NA <NA>
8: Google 2021Q1 c NA 8 30 2020END
9: Google 2021Q1 d NA 9 40 2020END
10: Google 2021Q1 e 4 10 NA <NA>
11: Google 2021Q2 a 1 11 NA <NA>
12: Google 2021Q2 b 2 12 NA <NA>
13: Google 2021Q2 c 3 13 NA <NA>
14: Google 2021Q2 d NA 14 40 2020END
15: Google 2021Q2 e NA 15 4 2021Q1
16: Facebook 2020END a 1 16 NA <NA>
17: Facebook 2020END b NA 17 20 2021Q1
18: Facebook 2020END c 5 18 NA <NA>
19: Facebook 2020END d NA 19 40 2021Q1
20: Facebook 2020END e 3 20 NA <NA>
21: Facebook 2021Q1 a 10 21 NA <NA>
22: Facebook 2021Q1 b 20 22 NA <NA>
23: Facebook 2021Q1 c NA 23 5 2020END
24: Facebook 2021Q1 d 40 24 NA <NA>
25: Facebook 2021Q1 e NA 25 3 2020END
26: Facebook 2021Q2 a 1 26 NA <NA>
27: Facebook 2021Q2 b 2 27 NA <NA>
28: Facebook 2021Q2 c NA 28 5 2020END
29: Facebook 2021Q2 d NA 29 40 2021Q1
30: Facebook 2021Q2 e NA 30 3 2020END
company date item values rn filled_value filled_from