Home > other >  How do I create many new variables in R, each of them based on multiple other variables?
How do I create many new variables in R, each of them based on multiple other variables?

Time:01-16

I have a large R data frame (>1 million records with >1,000 variables) that captures information about patient visits to hospitals. Included in this data frame are 30 different procedure statuses, locations, and codes: ProcedureStatus01, ProcedureLocation01, ProcedureCode01, ProcedureStatus02, ProcedureLocation02, ProcedureCode02, etc. I need to create a new set of 30 variables representing "final" procedure codes based on the following logic:

  • Where the procedure status is "Complete" and the procedure location is "Onsite", the original procedure code is kept as the final procedure code; and
  • Where the procedure code status is not "Complete" (e.g. "Cancelled") or the procedure location is not "Onsite" (e.g. "Offsite"), the final procedure code variable created is blank.

Creating these final variable values will allow me to search completed, onsite procedure codes for specific codes and create new variables based on those final codes.

The following code (using only 5 records and 3 procedure statuses/locations/codes as an example) produces the results that I am expecting, but applying this simple method to my larger data frame (30 procedures) quickly becomes unmanageable and invites the risk of error from copying and pasting 30 times. Is there a preferred approach to produce the same results in a more efficient way? I have found similar questions regarding how to create a new variable using multiple other variable values, but not for many variable combinations at once.

# Load dplyr Package

     library(dplyr)


# Create Sample Data Frame

     ProcedureStatus01 <- c("Complete", "Cancelled", "Complete", "Scheduled", "Scheduled")
     ProcedureLocation01 <- c("Offsite", "Onsite", "Onsite", "Offsite", "Onsite")
     ProcedureCode01 <- c("A123", "A123", "A500", "B296", "C901")
      
     ProcedureStatus02 <- c("Complete", "Complete", "Complete", "Scheduled", "Scheduled")
     ProcedureLocation02 <- c("Onsite", "Onsite", "Onsite", "Onsite", "Onsite")
     ProcedureCode02 <- c("Z562", "A500", "G164", "V413", "A123")
    
     ProcedureStatus03 <- c("Cancelled", "Complete", "Complete", "Cancelled", "Cancelled")
     ProcedureLocation03 <- c("Onsite", "Onsite", "Onsite", "Onsite", "Onsite")
     ProcedureCode03 <- c("P524", "W412", "A123", "C901", "Z554")

     DataFrame <- data.frame(ProcedureStatus01, ProcedureLocation01, ProcedureCode01, ProcedureStatus02, ProcedureLocation02, ProcedureCode02, ProcedureStatus03, ProcedureLocation03, ProcedureCode03)
    
    
# Add Completed Onsite Procedure Code Variables
    
     DataFrame <- DataFrame %>% mutate(ProcedureCodeFinal01 = if_else(ProcedureStatus01 == "Complete" & ProcedureLocation01 == "Onsite", ProcedureCode01, ""))
    
     DataFrame <- DataFrame %>% mutate(ProcedureCodeFinal02 = if_else(ProcedureStatus02 == "Complete" & ProcedureLocation02 == "Onsite", ProcedureCode02, ""))
    
     DataFrame <- DataFrame %>% mutate(ProcedureCodeFinal03 = if_else(ProcedureStatus03 == "Complete" & ProcedureLocation03 == "Onsite", ProcedureCode03, ""))

CodePudding user response:

Here's a solution that pivots the data into a longer format (more rows rather than more columns), computes "Final" status, and then pivots the data back into a wider format. I've run this on a simulated data frame of 90 columns and 1m rows, and it seems to complete in a few seconds (though your own machine's performance may vary).

First, reformat the data:

library(tidyverse)

df_long <- DataFrame %>% 
  mutate(row_id = row_number()) %>% 
  pivot_longer(-row_id, names_pattern = 'Procedure(Code|Status|Location)(\\d )', names_to = c('.value', 'proc_num'))

   row_id proc_num Status    Location Code 
    <int> <chr>    <chr>     <chr>    <chr>
 1      1 01       Complete  Offsite  A123 
 2      1 02       Complete  Onsite   Z562 
 3      1 03       Cancelled Onsite   P524 
 4      2 01       Cancelled Onsite   A123 
 5      2 02       Complete  Onsite   A500 
 6      2 03       Complete  Onsite   W412 
 7      3 01       Complete  Onsite   A500 
 8      3 02       Complete  Onsite   G164 
 9      3 03       Complete  Onsite   A123 
10      4 01       Scheduled Offsite  B296 
...

Then compute "Final" status and pivot this data back into your original format:

df_complete <- df_long %>% 
  mutate(
    Final = ifelse(Status == 'Complete' & Location == 'Onsite', Code, '')
  ) %>% 
  pivot_wider(names_from = proc_num, values_from = c(Status, Location, Code, Final), names_glue = 'Procedure{.value}{proc_num}')

  row_id ProcedureSt…¹ Proce…² Proce…³ Proce…⁴ Proce…⁵ Proce…⁶ Proce…⁷ Proce…⁸ Proce…⁹ Proce…˟ Proce…˟ Proce…˟
   <int> <chr>         <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
1      1 Complete      Comple… Cancel… Offsite Onsite  Onsite  A123    Z562    P524    ""      "Z562"  ""     
2      2 Cancelled     Comple… Comple… Onsite  Onsite  Onsite  A123    A500    W412    ""      "A500"  "W412" 
3      3 Complete      Comple… Comple… Onsite  Onsite  Onsite  A500    G164    A123    "A500"  "G164"  "A123" 
4      4 Scheduled     Schedu… Cancel… Offsite Onsite  Onsite  B296    V413    C901    ""      ""      ""     
5      5 Scheduled     Schedu… Cancel… Onsite  Onsite  Onsite  C901    A123    Z554    ""      ""      ""     
# … with abbreviated variable names ¹​ProcedureStatus01, ²​ProcedureStatus02, ³​ProcedureStatus03,
#   ⁴​ProcedureLocation01, ⁵​ProcedureLocation02, ⁶​ProcedureLocation03, ⁷​ProcedureCode01, ⁸​ProcedureCode02,
#   ⁹​ProcedureCode03, ˟​ProcedureFinal01, ˟​ProcedureFinal02, ˟​ProcedureFinal03

CodePudding user response:

It is possible, we can use across, and since all columns have a pattern we can make functions based on that. So, I will use the variables with Status, and base the rest of the funcion on them.

library(dplyr)
library(stringr)

DataFrame %>% 
  mutate(
    across(
      .cols = contains("Status"),
      .fns = ~if_else(
        condition = . == "Complete" & get(sub(x = cur_column(),"Status","Location")) == "Onsite",
        true = get(sub(x = cur_column(),"Status","Code")),
        false = ""
      ),
      .names = "{.col}_aux"
    )
  ) %>% 
  rename_with(.cols = ends_with("_aux"),.fn = ~str_remove(sub(x = .,"Status","CodeFinal"),"_aux"))
  • Related