Home > database >  Identifying which values were used for filling data in R?
Identifying which values were used for filling data in R?

Time:08-04

I have this dataframe:

df <- tibble(company = c(rep('Google', 15), rep('Facebook', 15)),
                  date = c(
                    rep('2020END', 5),
                    rep('2021Q1', 5),
                    rep('2021Q2', 5),
                    rep('2020END', 5),
                    rep('2021Q1', 5),
                    rep('2021Q2', 5)
                    
                  ), 
                  item = c(rep(c('a','b','c','d','e'), 6)), 
                  values = c(c(10,20,30,40,50), c(1,2, NA, NA, 4),
                             c(1,2,3,NA,NA),
                             c(1,NA,5,NA,3), c(10,20,NA,40,NA), 
                             c(1,2,NA,NA,NA)))

And i have the following function which fills in missing values by grouping by company and fills the missing values by the values column in a downward approach.

filler <- function(.x){
  .x <- .x %>% 
    group_by(item) %>% 
    fill(values, .direction = "down")
}

The reason for this is that I would like any missing data to be filled by the previous quarter up until the value of END for the previous year. This is translated as the following logic:

  1. If 2021Q1 value is missing, then fill the missing value from 2020END
  2. If 2021Q2 value is missing, fill from 2021Q1, if missing fill from 2020END
  3. If 2021Q3 value is missing, fill from 2021Q2, if missing fill from 2021Q1, if missing fill from 2020END
  4. If 2021Q4 value is missing, fill from 2021Q3, if missing fill from 2021Q2, if missing fill from 2021Q1, if missing fill from 2020END

So the logic cycles backwards to the previous years' end value which is always in the dataframe.

So far i have the following code which does this:

df %>%  
  group_by(company) %>%  
  group_split() %>%  
  map(.x=., .f = filler)

How can i create a column which keeps track of the date to which the missing values were filled from?

e.g. have something such as paste0 of the date column which is used to say where that value came from?

e.g. this table:

company date item value date_filled_from
Google 2020END a 10 2020END
... ... ... ... ...
Google 2021Q1 c 20 2020END

As from above it kept the date to which google & 2020Q1 was taken from during filling.

CodePudding user response:

here is a data.table approach using a rolling join on rownumbers (assuming your data is already in chronological order)

library(data.table)
# set to data.table format
setDT(df)
# assuming the data is already ordered, create rownumbers
# for rolling to
df[, rn := .I]
# set join on all rows wit NA-values,
#   with rolling baack to the previous rn with a non-NA value
df[is.na(values), c("filled_value", "filled_from") := 
     df[!is.na(values), ][df[is.na(values), ], 
                          .(values, date), 
                          on = .(company, item, rn), 
                          roll = Inf]]

output

     company    date item values rn filled_value filled_from
 1:   Google 2020END    a     10  1           NA        <NA>
 2:   Google 2020END    b     20  2           NA        <NA>
 3:   Google 2020END    c     30  3           NA        <NA>
 4:   Google 2020END    d     40  4           NA        <NA>
 5:   Google 2020END    e     50  5           NA        <NA>
 6:   Google  2021Q1    a      1  6           NA        <NA>
 7:   Google  2021Q1    b      2  7           NA        <NA>
 8:   Google  2021Q1    c     NA  8           30     2020END
 9:   Google  2021Q1    d     NA  9           40     2020END
10:   Google  2021Q1    e      4 10           NA        <NA>
11:   Google  2021Q2    a      1 11           NA        <NA>
12:   Google  2021Q2    b      2 12           NA        <NA>
13:   Google  2021Q2    c      3 13           NA        <NA>
14:   Google  2021Q2    d     NA 14           40     2020END
15:   Google  2021Q2    e     NA 15            4      2021Q1
16: Facebook 2020END    a      1 16           NA        <NA>
17: Facebook 2020END    b     NA 17           NA        <NA>
18: Facebook 2020END    c      5 18           NA        <NA>
19: Facebook 2020END    d     NA 19           NA        <NA>
20: Facebook 2020END    e      3 20           NA        <NA>
21: Facebook  2021Q1    a     10 21           NA        <NA>
22: Facebook  2021Q1    b     20 22           NA        <NA>
23: Facebook  2021Q1    c     NA 23            5     2020END
24: Facebook  2021Q1    d     40 24           NA        <NA>
25: Facebook  2021Q1    e     NA 25            3     2020END
26: Facebook  2021Q2    a      1 26           NA        <NA>
27: Facebook  2021Q2    b      2 27           NA        <NA>
28: Facebook  2021Q2    c     NA 28            5     2020END
29: Facebook  2021Q2    d     NA 29           40      2021Q1
30: Facebook  2021Q2    e     NA 30            3     2020END
     company    date item values rn filled_value filled_from

as you can see, Facebook 2020END b and Facebook 2020END d sitll have a NA-value, since thay have no preceeding value in values. You can easily flip the join above to roll to -Inf, so you will roll to the nearest next value of values.

df[is.na(filled_value) & is.na(values), c("filled_value", "filled_from") := 
     df[!(is.na(filled_value) & is.na(values)), ][df[is.na(filled_value) & is.na(values), ], 
                          .(values, date), 
                          on = .(company, item, rn), 
                          roll = -Inf]]

resulting in

     company    date item values rn filled_value filled_from
 1:   Google 2020END    a     10  1           NA        <NA>
 2:   Google 2020END    b     20  2           NA        <NA>
 3:   Google 2020END    c     30  3           NA        <NA>
 4:   Google 2020END    d     40  4           NA        <NA>
 5:   Google 2020END    e     50  5           NA        <NA>
 6:   Google  2021Q1    a      1  6           NA        <NA>
 7:   Google  2021Q1    b      2  7           NA        <NA>
 8:   Google  2021Q1    c     NA  8           30     2020END
 9:   Google  2021Q1    d     NA  9           40     2020END
10:   Google  2021Q1    e      4 10           NA        <NA>
11:   Google  2021Q2    a      1 11           NA        <NA>
12:   Google  2021Q2    b      2 12           NA        <NA>
13:   Google  2021Q2    c      3 13           NA        <NA>
14:   Google  2021Q2    d     NA 14           40     2020END
15:   Google  2021Q2    e     NA 15            4      2021Q1
16: Facebook 2020END    a      1 16           NA        <NA>
17: Facebook 2020END    b     NA 17           20      2021Q1
18: Facebook 2020END    c      5 18           NA        <NA>
19: Facebook 2020END    d     NA 19           40      2021Q1
20: Facebook 2020END    e      3 20           NA        <NA>
21: Facebook  2021Q1    a     10 21           NA        <NA>
22: Facebook  2021Q1    b     20 22           NA        <NA>
23: Facebook  2021Q1    c     NA 23            5     2020END
24: Facebook  2021Q1    d     40 24           NA        <NA>
25: Facebook  2021Q1    e     NA 25            3     2020END
26: Facebook  2021Q2    a      1 26           NA        <NA>
27: Facebook  2021Q2    b      2 27           NA        <NA>
28: Facebook  2021Q2    c     NA 28            5     2020END
29: Facebook  2021Q2    d     NA 29           40      2021Q1
30: Facebook  2021Q2    e     NA 30            3     2020END
     company    date item values rn filled_value filled_from
  • Related