Modify multiple data sets almost the same way with purrr walk(), get() & assign()-CodePudding

I have two very similar data sets like this (simplified) one:

library(tidyverse)

dataset <- tribble(
  ~patient, ~tumor, ~trt_date, ~fup_date, ~system,
  001, "t1", "2022-01-01", "2022-05-05", NA,
  001, "t1", "2022-01-01", "2022-05-05", NA,
  001, "t1", "2022-01-01", "2022-05-05", NA,
  002, "t1", "2022-02-02", "2022-07-07", 2,
  002, "t1", "2022-02-02", "2022-07-07", 2,
  002, "t2", "2022-02-02", "2022-07-07", 2,
  002, "t2", "2022-02-02", "2022-07-07", 2,
  002, "t2", "2022-02-02", "2022-07-07", 2,
  003, "t1", "2022-01-01", "2022-05-05", 1,
  003, "t2", "2022-06-06", "2022-07-07", 1,
  003, "t3", "2022-06-06", "2022-08-08", 1,
  004, "t1", "2022-05-05", "2022-07-07", NA,
  004, "t1", "2022-05-05", "2022-07-07", NA,
  004, "t2", "2022-11-11", "2022-12-12", NA,
  004, "t2", "2022-11-11", "2022-12-12", NA,
  005, "t1", "2022-02-02", "2022-09-09", 2,
  005, "t1", "2022-02-02", "2022-09-09", 2,
  005, "t1", "2022-02-02", "2022-09-09", 2,
  005, "t2", "2022-05-05", "2022-07-07", NA,
  005, "t3", "2022-10-10", "2022-11-11", NA,
  005, "t3", "2022-10-10", "2022-11-11", NA,
  006, "t1", NA, "2022-11-11", 2,
  006, "t2", NA, "2022-11-11", 2
)

and a filtered version:

dataset_system <- dataset %>% 
  filter(!is.na(system))

I'd like to modify them in mostly the same way except a few steps like grouping them in a different way just ahead of using distinct() before continuing with steps that apply to both data sets again. I think I could do this with map() but the result would be a list containing both dataframes instead of them staying seperate entities in the environment. So I tried walk() in combination with get() and assign() but I can't get any conditional operation to work within this block to execute the steps where they should be treated differently.

Attempt A:

  .x = c("dataset", "dataset_system"),
  .f = function(df_name) {
    df <- get(df_name, envir = .GlobalEnv)
    
    df <- df %>% 
      filter(!is.na(trt_date)) %>% 
      if(df_name == "dataset") {
        group_by(patient)
      } else {
        group_by(patient, tumor)} %>% 
      distinct() %>% 
      ungroup()
    
    new_df <- paste(df_name, "system", sep = "_")

    assign(new_df, df, envir = .GlobalEnv)
  }
)

Results in : Error in if (.) df_name == "dataset" else { : the condition has length > 1

Attempt B:

  .x = c("dataset", "dataset_system"),
  .f = function(df_name) {
    df <- get(df_name, envir = .GlobalEnv)
    
    df <- df %>% 
      filter(!is.na(trt_date)) %>% 
      when(dataset == "dataset" 
         ~ group_by(patient), 
         group_by(patient, tumor)) %>% 
      
      distinct() %>% 
      ungroup()
    
    new_df <- paste(df_name, "system", sep = "_")
    
    assign(new_df, df, envir = .GlobalEnv)
    }
  )

Gives: Error in group_by(patient, tumor) : object 'patient' not found

Is it just that my syntax is messed up or is this not the right way to do that kind of thing anyway? Thank you!

CodePudding user response：

It sounds like you should be using purrr::map (or variants) instead of purrr::walk, so you should just be passing the datasets in a list to purrr::map. This will mean you don't have to use get or assign at all.

In addition, you can make use of purrr::imap to get the names of the datasets, provided you supply a named list. Something like the following should work:

my_dataset_list <- list("dataset" = dataset,
                        "dataset_system" = dataset_system)

new_datasets <- purrr::imap(
  .x = my_dataset_list,
  .f = function(df, df_name) {
    
    new_df <- df %>%
      filter(!is.na(trt_date))
    
    if (df_name == "dataset") {
      new_df <- new_df %>%
        group_by(patient)
    } else {
      new_df <- new_df %>%
        group_by(patient, tumor)
    }
    
    new_df %>%
      distinct() %>% 
      ungroup()
    }
  )

new_datasets[["dataset"]]
new_datasets[["dataset_system"]]

CodePudding user response：

I got it working with imap, thank you for your help cnbrownlie!

Now I'm wondering if I could replace the if_else block with all those new_df <- new_df statements with the help of when() or case_when().

I'm trying something like this...

  .x = datasets,
  .f = function(df, df_name) {
    new_df <- df %>%
      filter(!is.na(trt_date) %>%
      
      when(df_name == "xy" 
           ~ group_by(.$patient, .$tumor),
           ~ group_by(.$patient)) %>% 
      
      distinct() %>%  
      ungroup() %>% 
      
      when(df_name == "xy" 
           ~ group_by(.$patient, .$tumor), 
           ~ group_by(.$patient)) %>% 
      
      type_convert() %>%
      summarise(var = n())
  }
)

... but get an Error in UseMethod("group_by") : no applicable method for 'group_by' applied to an object of class "c('integer', 'numeric')"

Would be even more elegant than with the if_else. Thanks for your advice!