Home > Software engineering >  Impute missing values within each group based on an equation (in R)
Impute missing values within each group based on an equation (in R)

Time:04-06

My data are as follows:

df <- structure(list(year = c(2019L, 2019L, 2019L, 2019L, 2020L, 2020L, 
2020L, 2020L), site = c("A", "B", "C", "D", "A", "B", "C", "D"
), value = c(1000L, 900L, 700L, NA, 800L, 1200L, 1500L, NA), dist = c(10L, 
15L, 30L, 36L, 10L, 15L, 30L, 36L)), class = "data.frame", row.names = c(NA, 
-8L))

I would like to impute the NA values for site D in each year based on the following.

  1. Calculate the difference between the value at site A and the value at site B.

1000-900=100 (for year 2019 in my example data)

  1. Divide the difference between values at site A and site B by the difference between dist A and dist B.

100/5=20

  1. Caluculate the difference between the value at site B and the value at site C.

900-700=200

  1. Divide the difference between values at site B and site C by the difference between dist B and dist C.

200/15=13.33

  1. Average the outcomes from steps 2) and 4).

16.67

  1. Multiply the average by the difference between dist C and dist D.

16.67*6=100.02

  1. If the value at site A is greater than the value at site C, the NA value for site D = the outcome of step 6) subtracted from the value at site C.

In 2019, the value for Site D = 599.98

  1. If the value at site A is less than the value at site C, the NA value for site D = the outcome of step 6) added to the value at site C.

In 2020, the value for Site D = 1800

In my example data, the outcome I am looking for is this:

df <- structure(list(year = c(2019L, 2019L, 2019L, 2019L, 2020L, 2020L, 
2020L, 2020L), site = c("A", "B", "C", "D", "A", "B", "C", "D"
), value = c(1000L, 900L, 700L, 599.98L, 800L, 1200L, 1500L, 1800L), dist = c(10L, 
15L, 30L, 36L, 10L, 15L, 30L, 36L)), class = "data.frame", row.names = c(NA, 
-8L))

I have many years in my real data, so I would like to do this for each year. Rounding not necessary.

Thank you in advance!

CodePudding user response:

There are three steps here. The first is to code up your formula for imputation, the second is to apply it to each year of your data frame, and the third is to substitute the imputed values.

To do the first, I've created a function that takes in the sites, their values, and their dists, for a single year, and applies your logic:

impute_year <- function(site, value, dist) {
  diff_ab_val <- value[site == 'A'] - value[site == 'B']
  diff_bc_val <- value[site == 'B'] - value[site == 'C']
  
  diff_ab_dist <- dist[site == 'A'] - dist[site == 'B']
  diff_bc_dist <- dist[site == 'B'] - dist[site == 'C']
  diff_cd_dist <- dist[site == 'C'] - dist[site == 'D']
  
  normalized_diff_ab <- abs(diff_ab_val / diff_ab_dist)
  normalized_diff_bc <- abs(diff_bc_val / diff_bc_dist)
  
  avg_normalized_diff <- (normalized_diff_ab   normalized_diff_bc) / 2
  
  value_shift <- abs(avg_normalized_diff * diff_cd_dist)
  
  if (value[site == 'A'] < value[site == 'C']) {
    return(value[site == 'C']   value_shift)
  }
  return(value[site == 'C'] - value_shift)
}

From the way you did calculations, it seems like you always want to consider positive values (e.g. ignore the fact that dist at A might be be bigger than dist at B while value at A might be smaller), which is why I include calls to take the absolute value via abs(). I also made sure to check if there really is a missing value for site D -- otherwise, we leave it as it is (you can take out the first if statement if this is not the desired behavior).

The next step is to compute this for all years. This is relatively simple using dplyr::group_by() and dplyr::summarize(). The combination of the two essentially split up your data frame into many different ones according to the value of year, and then for each, apply the imputation function.

library(dplyr)
all_imputed_values <-
    df %>%
    group_by(year) %>%
    summarize(imputed = impute(site, value, dist)) 

The final step is to substitute the entries of all_imputed_values for the site D entries in the original data frame:

df$value[df$site == 'D'] <- imputed_values$imputed

Putting it all together in a function:

impute_all_years <- function(df) {
  imputed_values <- 
    df %>%
    group_by(year) %>%
    summarize(imputed = impute(site, value, dist))
  
  years <- filter(df, site == 'D')$year
  df$value[df$site == 'D'] <- imputed_values$imputed
  return(df)
}

The code is not robust to all possible things that may happen in your data (e.g. it assumes years are listed in order in your data, it will break if there is a year without a site D etc.), but without knowing more about your data, my objective was simply to provide an easy-to-read overview of how you may approach this type of task so that you can modify it as needed for your purposes.

  •  Tags:  
  • r
  • Related