How to replace values in specific rows and columns with NA using a reference table?-CodePudding

I need to set values to NA for specific columns and rows using a separate reference table and am not sure how.

I need to set the value to NA if the column name (field name) is not associated with the value in the column called event, using a reference table of event-fields mappings.

This is a simplified example data of what I've got. For the real data, I've ~900 rows and >300 columns to replace NAs in, and the columns are of different types.

df <- tibble::tribble(
~event, ~drug, ~status,
"referral", "drugA", 0,
"therapy", "drugA", 1
)

I have a reference table like below that says what fields are associated with each event.

event_fields <- tibble::tribble(
  ~unique_event_name, ~field_name,
  "referral", "record_id",
  "referral", "casetype",
  "therapy", "drug",
  "therapy", "status"
)

The output I’m trying to get is below e.g. drug and status are not fields associated with the referral event in the event_fields table above so they should get set to NA.

desired_result <- tibble::tribble(
~event_name, ~drug, ~status,
"referral", NA, NA,
"therapy", "drugA", 1
)

One thing I've tried is below (based on Replace multiple values in a dataframe with NA based on conditions given in another dataframe in R, the closest question I could find) but it doesn't work. I'm not sure how to use the event value for each row e.g. "referral" and the name of the field column e.g. "drug" in the filter() or if there's a better way to do this.

library(tidyverse)
df %>% mutate(across(drug:status,  
                     ~ replace(., !cur_column() %in% 
event_fields %>% filter(unique_event_name == event) %>% pull(field_name), 
NA) ))

which gives the error

Error: Problem with `mutate()` input `..1`.
ℹ `..1 = across(...)`.
x no applicable method for 'filter' applied to an object of class "logical"

Any help will highly appreciated!

CodePudding user response：

As we are creating the logical based on 'event' column and the corresponding 'unique_event_name' on the same row of the 'field_name' that matches the column names (cur_column()), subset the 'unique_event_name' based on the logical on 'field_name' and then do the second logical on the 'event' to replace

library(dplyr)
df %>%
    mutate(across(drug:status, ~ replace(.,
     event != event_fields$unique_event_name[
           event_fields$field_name == cur_column()], NA)))

-output

# A tibble: 2 × 3
  event    drug  status
  <chr>    <chr>  <dbl>
1 referral <NA>      NA
2 therapy  drugA      1

CodePudding user response：

You may try this solution. Although it works on the toy example this might still fail depending on your real data.

The idea is to replace the non-matching fields and replace them with NA. First find the non-matching rows, then select the corresponding columns.

desired_result <- df
 
desired_result[ df$event != unique( 
   event_fields$unique_event_name[ event_fields$field_name %in% colnames( df )] 
   ), na.omit( match( event_fields$field_name, colnames( df ) ) ) ] <- NA

desired_result
# A tibble: 2 x 3
  event    drug  status
  <chr>    <chr>  <dbl>
1 referral NA        NA
2 therapy  drugA      1