How to filter in multiple columns in R if the category starts with specific letter in R?-CodePudding

I have a huge dataset with over 2 million obs and 100 columns. There are 9 columns that contain ICD-10 disease codes that each of them start with a different letter.

For example:

icd1 <- c("O230", "B540", "D990", "Y555", "E980", "J777", "P090", "Q090", "R433")

icd2 <- c("O230", "B540", "D990", "Y555", "E980", "J777", "P090", "Q090", "Z433")

icd3 <- c("X230", "B540", "D990", "Y555", "E980", "J777", "P090", "Q090", "Z433")

data <- as.data.frame(rbind(icd1, icd2, icd3))

What I need to do is create a new column called "ICD_Z" that checks every one of those 9 ICD-10 columns codes to see if any of them starts with the letter Z, so if that's the case, the column I created would receive 1, or else, 0. But it only can receive 1 if all those 9 columns don't have an ICD-10 code that starts with the letter O.

So my output would look like this:

How can I do that?

CodePudding user response：

You could use if_any and if_all:

data %>%
  mutate(ICD_Z = if_any(V1:V9, ~grepl('^Z', .)) * 
           if_all(V1:V9, ~!grepl('^O', .)))

       V1   V2   V3   V4   V5   V6   V7   V8   V9 ICD_Z
icd1 O230 B540 D990 Y555 E980 J777 P090 Q090 R433     0
icd2 O230 B540 D990 Y555 E980 J777 P090 Q090 Z433     0
icd3 X230 B540 D990 Y555 E980 J777 P090 Q090 Z433     1

Edit:

You can use a variable to store the positions of the columns in question:

nms <- c(1,2,3,5,7,8)
data %>%
  mutate(ICD_Z = if_any(all_of(nms), ~grepl('^Z', .)) * 
           if_all(all_of(nms), ~!grepl('^O', .)))

CodePudding user response：

With a combination of rowwise() and c_across() it's very straightforward:

library(dplyr)

icd1 <- c("O230", "B540", "D990", "Y555", "E980", "J777", "P090", "Q090", "R433")

icd2 <- c("O230", "B540", "D990", "Y555", "E980", "J777", "P090", "Q090", "Z433")

icd3 <- c("X230", "B540", "D990", "Y555", "E980", "J777", "P090", "Q090", "Z433")

data <- as.data.frame(rbind(icd1, icd2, icd3))

data %>% 
  rowwise() %>% 
  mutate(has_z = any(startsWith(c_across(V1:V9), "Z")),
         has_o = any(startsWith(c_across(V1:V9), "O")),
         icdz = as.numeric(has_z & !has_o)
  ) %>% 
  select(-has_z, -has_o)

#> # A tibble: 3 × 10
#> # Rowwise: 
#>   V1    V2    V3    V4    V5    V6    V7    V8    V9     icdz
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 O230  B540  D990  Y555  E980  J777  P090  Q090  R433      0
#> 2 O230  B540  D990  Y555  E980  J777  P090  Q090  Z433      0
#> 3 X230  B540  D990  Y555  E980  J777  P090  Q090  Z433      1

^{Created on 2022-05-25 by the reprex package (v2.0.1)}

CodePudding user response：

Update: an even faster approach:

icd_cols = paste0("V",1:9)
f <- function(v) !any(v=="O") & any(v=="Z")
setDT(data)
set(data, j="icdz", value= apply(data[,..icd_cols],1,\(i) f(substr(i,1,1))))

Original Solution

It is possible with your large dataset you will have some speed up with this approach, that uses data.table. On the small 3 row example, I found informally this to be about 5 times faster.

identify the list of 9 ICD columns; I've used a simple construction here to get V1:V9, but you could use icd_cols = colnames(data)[c(20,21,37:45)]
create a small function that evaluates a vector of first letters
set the original frame as data.table
melt and apply the function by id, and assign the result to the original frame

icd_cols = paste0("V",1:9)
f <- function(v) !any(v=="O") & any(v=="Z")
setDT(data)
data[, icdz:=(melt(data[,id:=.I],"id",icd_cols)[,f(substr(value,1,1)), by=id]$V1)][id:=NULL]

Output:

     V1   V2   V3   V4   V5   V6   V7   V8   V9  icdz
1: O230 B540 D990 Y555 E980 J777 P090 Q090 R433 FALSE
2: O230 B540 D990 Y555 E980 J777 P090 Q090 Z433 FALSE
3: X230 B540 D990 Y555 E980 J777 P090 Q090 Z433  TRUE