I have a large R data frame (>1 million records with >1,000 variables) that captures information about patient visits to hospitals. Included in this data frame are 30 different procedure statuses, locations, and codes: ProcedureStatus01, ProcedureLocation01, ProcedureCode01, ProcedureStatus02, ProcedureLocation02, ProcedureCode02, etc. I need to create a new set of 30 variables representing "final" procedure codes based on the following logic:
- Where the procedure status is "Complete" and the procedure location is "Onsite", the original procedure code is kept as the final procedure code; and
- Where the procedure code status is not "Complete" (e.g. "Cancelled") or the procedure location is not "Onsite" (e.g. "Offsite"), the final procedure code variable created is blank.
Creating these final variable values will allow me to search completed, onsite procedure codes for specific codes and create new variables based on those final codes.
The following code (using only 5 records and 3 procedure statuses/locations/codes as an example) produces the results that I am expecting, but applying this simple method to my larger data frame (30 procedures) quickly becomes unmanageable and invites the risk of error from copying and pasting 30 times. Is there a preferred approach to produce the same results in a more efficient way? I have found similar questions regarding how to create a new variable using multiple other variable values, but not for many variable combinations at once.
# Load dplyr Package
library(dplyr)
# Create Sample Data Frame
ProcedureStatus01 <- c("Complete", "Cancelled", "Complete", "Scheduled", "Scheduled")
ProcedureLocation01 <- c("Offsite", "Onsite", "Onsite", "Offsite", "Onsite")
ProcedureCode01 <- c("A123", "A123", "A500", "B296", "C901")
ProcedureStatus02 <- c("Complete", "Complete", "Complete", "Scheduled", "Scheduled")
ProcedureLocation02 <- c("Onsite", "Onsite", "Onsite", "Onsite", "Onsite")
ProcedureCode02 <- c("Z562", "A500", "G164", "V413", "A123")
ProcedureStatus03 <- c("Cancelled", "Complete", "Complete", "Cancelled", "Cancelled")
ProcedureLocation03 <- c("Onsite", "Onsite", "Onsite", "Onsite", "Onsite")
ProcedureCode03 <- c("P524", "W412", "A123", "C901", "Z554")
DataFrame <- data.frame(ProcedureStatus01, ProcedureLocation01, ProcedureCode01, ProcedureStatus02, ProcedureLocation02, ProcedureCode02, ProcedureStatus03, ProcedureLocation03, ProcedureCode03)
# Add Completed Onsite Procedure Code Variables
DataFrame <- DataFrame %>% mutate(ProcedureCodeFinal01 = if_else(ProcedureStatus01 == "Complete" & ProcedureLocation01 == "Onsite", ProcedureCode01, ""))
DataFrame <- DataFrame %>% mutate(ProcedureCodeFinal02 = if_else(ProcedureStatus02 == "Complete" & ProcedureLocation02 == "Onsite", ProcedureCode02, ""))
DataFrame <- DataFrame %>% mutate(ProcedureCodeFinal03 = if_else(ProcedureStatus03 == "Complete" & ProcedureLocation03 == "Onsite", ProcedureCode03, ""))
CodePudding user response:
Here's a solution that pivots the data into a longer format (more rows rather than more columns), computes "Final" status, and then pivots the data back into a wider format. I've run this on a simulated data frame of 90 columns and 1m rows, and it seems to complete in a few seconds (though your own machine's performance may vary).
First, reformat the data:
library(tidyverse)
df_long <- DataFrame %>%
mutate(row_id = row_number()) %>%
pivot_longer(-row_id, names_pattern = 'Procedure(Code|Status|Location)(\\d )', names_to = c('.value', 'proc_num'))
row_id proc_num Status Location Code
<int> <chr> <chr> <chr> <chr>
1 1 01 Complete Offsite A123
2 1 02 Complete Onsite Z562
3 1 03 Cancelled Onsite P524
4 2 01 Cancelled Onsite A123
5 2 02 Complete Onsite A500
6 2 03 Complete Onsite W412
7 3 01 Complete Onsite A500
8 3 02 Complete Onsite G164
9 3 03 Complete Onsite A123
10 4 01 Scheduled Offsite B296
...
Then compute "Final" status and pivot this data back into your original format:
df_complete <- df_long %>%
mutate(
Final = ifelse(Status == 'Complete' & Location == 'Onsite', Code, '')
) %>%
pivot_wider(names_from = proc_num, values_from = c(Status, Location, Code, Final), names_glue = 'Procedure{.value}{proc_num}')
row_id ProcedureSt…¹ Proce…² Proce…³ Proce…⁴ Proce…⁵ Proce…⁶ Proce…⁷ Proce…⁸ Proce…⁹ Proce…˟ Proce…˟ Proce…˟
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 Complete Comple… Cancel… Offsite Onsite Onsite A123 Z562 P524 "" "Z562" ""
2 2 Cancelled Comple… Comple… Onsite Onsite Onsite A123 A500 W412 "" "A500" "W412"
3 3 Complete Comple… Comple… Onsite Onsite Onsite A500 G164 A123 "A500" "G164" "A123"
4 4 Scheduled Schedu… Cancel… Offsite Onsite Onsite B296 V413 C901 "" "" ""
5 5 Scheduled Schedu… Cancel… Onsite Onsite Onsite C901 A123 Z554 "" "" ""
# … with abbreviated variable names ¹ProcedureStatus01, ²ProcedureStatus02, ³ProcedureStatus03,
# ⁴ProcedureLocation01, ⁵ProcedureLocation02, ⁶ProcedureLocation03, ⁷ProcedureCode01, ⁸ProcedureCode02,
# ⁹ProcedureCode03, ˟ProcedureFinal01, ˟ProcedureFinal02, ˟ProcedureFinal03
CodePudding user response:
It is possible, we can use across
, and since all columns have a pattern we can make functions based on that. So, I will use the variables with Status, and base the rest of the funcion on them.
library(dplyr)
library(stringr)
DataFrame %>%
mutate(
across(
.cols = contains("Status"),
.fns = ~if_else(
condition = . == "Complete" & get(sub(x = cur_column(),"Status","Location")) == "Onsite",
true = get(sub(x = cur_column(),"Status","Code")),
false = ""
),
.names = "{.col}_aux"
)
) %>%
rename_with(.cols = ends_with("_aux"),.fn = ~str_remove(sub(x = .,"Status","CodeFinal"),"_aux"))