Home > Software design >  How to fill in the missing values in a dataframe in R, where the logic for filling in missing values
How to fill in the missing values in a dataframe in R, where the logic for filling in missing values

Time:10-26

I am working with cost data for a retailer, where I am making some predictions using gam that look as follows (sample data, self-generated). The GAM fits values in the middle, but has some NAs at the extremes. The elasticity is calculated as a percentage change in cost over percentage change in items.

df <- tibble(
  factor = seq(0.7,1.3, 0.1),
  items = c(7, 8, 9, 10, 11, 12, 13),
  cost = c(NA, NA, 70, 80, 90, NA, NA),
  elasticity = c(NA, NA, 0.5, 0.6, 0.7, NA, NA)
)

An easy estimate for the elasticises is to extend the last known value up and down.

df %>%
    fill(elasticity, .direction = 'updown') ->
    df
factor items cost elasticity
0.7 7 NA 0.5
0.8 8 NA 0.5
0.9 9 70 0.5
1.0 10 80 0.6
1.1 11 90 0.7
1.2 12 NA 0.7
1.3 13 NA 0.7

I would like to calculate the cost, having estimated the elasticity of cost. For example, for a factor of 1.2, the items are 12 and elasticity is 0.7. The percentage change in items is (12-11)/11 = 9.09%, so the percentage change in cost should be 0.7 * 9.09% = 6.36%. Since cost for factor of 1.1 is 90, the cost for factor of 1.2 is 95.72. And the same propagated both down and up.

I cannot figure a way of doing this. Can someone suggest how this can be done in R and preferably in dplyr?

CodePudding user response:

One possible solution using dplyr and purrr:

Libraries

library(dplyr)
library(tidyr)
library(purrr) # used for pmap() in dbl_fill()

Data

NB you should avoid using <- when naming list elements, tibble columns and the like.

df <- tibble(
  factor = seq(0.7,1.3, 0.1),
  items = c(7, 8, 9, 10, 11, 12, 13),
  cost = c(NA, NA, 70, 80, 90, NA, NA),
  elasticity = c(NA, NA, 0.5, 0.6, 0.7, NA, NA)
)

df <- fill(df, elasticity, .direction = 'updown')

Create dbl_fill()

I realised whilst mutating df() that I needed to recursively lag() or lead() columns according to the 'depth' of missing values. I assumed your data was example data, so figured it would be best to make a generic function that lags/leads a numeric vector up to a maximum acceptable depth (2 is enough for this example).

dbl_fill <- function(x, lag_or_lead = c("lag", "lead"), max_fill = 2){

  lag_or_lead <- match.arg(lag_or_lead)
  if(lag_or_lead == "lag") fill_function <- lag
  else fill_function <- lead
  
  n_list <- as.list(1:max_fill)
  
  fill_list <- lapply(n_list, function(y) fill_function(x, y))
 
  vector_out <- pmap_dbl(fill_list, coalesce)
  
  return(vector_out)
}

Perform various mutations

We can perform prolific mutations on df, then just select() away the columns we don't want to keep.

I'm lagging/leading df$items here because I don't want to assume values in that column are incremented by one between rows. But if they are, it would be sufficient to do items - 1 as opposed to lag(items). For the example, these are equivalent.

df <- mutate(df,
             ## calculate new costs from lagged values
             pc_change_items = (items - lag(items)) / lag(items),
             lagged_cost     = dbl_fill(cost, "lag", 2),
             pc_change_cost  = pc_change_items * elasticity,
             lag_cost        = lagged_cost * (1   pc_change_cost),
             
             ## calculate new costs from lead'ed values
             pc_change_items = (lead(items) - items) / items,
             leaded_cost     = dbl_fill(cost, "lead", 2),
             pc_change_cost  = pc_change_items * elasticity,
             lead_cost       = leaded_cost * (1 - pc_change_cost)
             )

Drop intermediate columns

df <- select(df, factor, items, cost, elasticity, lag_cost, lead_cost)
df
#> # A tibble: 7 × 6
#>   factor items  cost elasticity lag_cost lead_cost
#>    <dbl> <dbl> <dbl>      <dbl>    <dbl>     <dbl>
#> 1    0.7     7    NA        0.5     NA        65  
#> 2    0.8     8    NA        0.5     NA        65.6
#> 3    0.9     9    70        0.5     NA        75.6
#> 4    1      10    80        0.6     74.7      84.6
#> 5    1.1    11    90        0.7     85.6      NA  
#> 6    1.2    12    NA        0.7     95.7      NA  
#> 7    1.3    13    NA        0.7     95.2      NA

Coalesce costs

Note that here I'm giving priority to lag_cost over lead_cost, but that's completely arbitrary and you may want to justify it. It might be more balanced to get the mean of the two where both are available, but that's out of scope for this answer.

mutate(df, cost = coalesce(cost, lag_cost, lead_cost)) |> 
  select(-lag_cost, -lead_cost)
#> # A tibble: 7 × 4
#>   factor items  cost elasticity
#>    <dbl> <dbl> <dbl>      <dbl>
#> 1    0.7     7  65          0.5
#> 2    0.8     8  65.6        0.5
#> 3    0.9     9  70          0.5
#> 4    1      10  80          0.6
#> 5    1.1    11  90          0.7
#> 6    1.2    12  95.7        0.7
#> 7    1.3    13  95.2        0.7

Created on 2022-10-25 with reprex v2.0.2

CodePudding user response:

To do the rolling fill in one pass with dplyr:

library(dplyr)

df <- tibble(
  factor = seq(0.7,1.3, 0.1),
  items = c(7, 8, 9, 10, 11, 12, 13),
  cost = c(NA, NA, 70, 80, 90, NA, NA),
  elasticity = c(NA, NA, 0.5, 0.6, 0.7, NA, NA)
)

df %>%
    tidyr::fill(elasticity, .direction = 'updown') ->
    df

df <- df %>% 
  group_by(na_grp = with(
    rle(is.na(cost) | is.na(lead(cost))), 
    rep(seq_along(lengths),lengths))) %>% 
  mutate(cost = if_else(is.na(cost), head(cost,1) * cumprod(
    if_else(is.na(cost), (items / lag(items)-1) * elasticity   1, 1)),cost)) %>% 
  group_by(na_grp = with(
    rle(is.na(cost) | is.na(lag(cost))), 
    rep(seq_along(lengths),lengths))) %>% 
  mutate(cost = if_else(is.na(cost), tail(cost,1) / rev(cumprod(rev(
     if_else(is.na(cost), (lead(items) / items - 1) * lead(elasticity)   1, 1)))),cost)) %>% 
  ungroup() %>% 
  select(-na_grp)

df
#> # A tibble: 7 × 4
#>   factor items  cost elasticity
#>    <dbl> <dbl> <dbl>      <dbl>
#> 1    0.7     7  61.5        0.5
#> 2    0.8     8  65.9        0.5
#> 3    0.9     9  70          0.5
#> 4    1      10  80          0.6
#> 5    1.1    11  90          0.7
#> 6    1.2    12  95.7        0.7
#> 7    1.3    13 101.         0.7
  • Related