Home > other >  Is it possible to impute multiple values from a different dataframe dependent on a condition?
Is it possible to impute multiple values from a different dataframe dependent on a condition?

Time:12-09

Suppose I have the following dataframe...

# Starting dataframe
data <- tribble(
  ~ID, ~Excluded, ~colA, ~colB, ~colC, ~col_mean, ~varA, ~varB,
  "A", TRUE,      1,     1,     1,     1,         "X",   10,
  "B", FALSE,     NA,    2,     2,     NA,        "Y",   20,
  "C", FALSE,     3,     3,     3,     3,         "Z",   30
)

And a subsetted dataframe (i.e. fewer observations exist) where missing values have been imputed, e.g. ...

# Dataframe with imputed values
data_imputed <- tribble(
  ~ID, ~Excluded, ~colA, ~colB, ~colC, ~col_mean, ~varA, ~varB,
  "B", FALSE,     2,     2,     2,     2,         "Y",   20,
  "C", FALSE,     3,     3,     3,     3,         "Z",   30
)

How do I replace values in the original dataframe with those from the imputed dataframe when a particular column (e.g. col_mean) has a missing value?

Note: I don't want to replace the whole row with the row from the imputed dataframe, just a specified set of columns (e.g., in this case, those starting with "col").

The target dataframe would look like this...

# Target dataframe
data <- tribble(
  ~ID, ~Excluded, ~colA, ~colB, ~colC, ~col_mean, ~varA, ~varB,
  "A", TRUE,      1,     1,     1,     1,         "X",   10,
  "B", FALSE,     2,     2,     2,     2,         "Y",   20,
  "C", FALSE,     3,     3,     3,     3,         "Z",   30
)

I have tried to summarise the problem with this figure...

enter image description here

I need to do this for four or five sets of columns, so something where I can specify the condition (e.g. is.na(col_mean)) and the columns to use (using regex) would make things easier.

I tend to use tidyverse, so code that works with tidyverse syntax is preferred.

CodePudding user response:

You could use the rows_ family from dplyr. In this case, rows_patch() works well. It modifies existing rows by some key columns (i.e. ID), but only overwrites NA values.

library(dplyr)

rows_patch(data, data_imputed, by = "ID")

# # A tibble: 3 × 8
#   ID    Excluded  colA  colB  colC col_mean varA   varB
#   <chr> <lgl>    <dbl> <dbl> <dbl>    <dbl> <chr> <dbl>
# 1 A     TRUE         1     1     1        1 X        10
# 2 B     FALSE        2     2     2        2 Y        20
# 3 C     FALSE        3     3     3        3 Z        30

CodePudding user response:

left_join(data, data_imputed, by = "ID") %>%
  mutate(across(ends_with(".x"), ~ coalesce(., cur_data()[[sub("\\.x$", ".y", cur_column())]]))) %>%
  select(-ends_with(".y")) %>%
  rename_with(.fn = ~ sub("\\.x$", "", .))
# # A tibble: 3 x 8
#   ID    Excluded  colA  colB  colC col_mean varA   varB
#   <chr> <lgl>    <dbl> <dbl> <dbl>    <dbl> <chr> <dbl>
# 1 A     TRUE         1     1     1        1 X        10
# 2 B     FALSE        2     2     2        2 Y        20
# 3 C     FALSE        3     3     3        3 Z        30
  • Related