Home > Software design >  Use mutate case_when() in a specific range of columns in dplyr
Use mutate case_when() in a specific range of columns in dplyr

Time:03-12

I have a large data frame that looks like the df1. I want to search in the range of columns between col2:col4 (col2,col3,col4)if any character contains the string S.

library(tidyverse)

df <- tibble(position=c(100,200,300),
             correction=c("62M89S", 
                          "8M1D55M88S",
                          "1S25M1S36M89S"))

df1 <- df %>% 
  separate(correction, into = str_c("col", 1:5), 
           sep = "(?<=\\D)(?=\\d)", fill = "left", remove = FALSE)

df1
#> # A tibble: 3 × 7
#>   position correction    col1  col2  col3  col4  col5 
#>      <dbl> <chr>         <chr> <chr> <chr> <chr> <chr>
#> 1      100 62M89S        <NA>  <NA>  <NA>  62M   89S  
#> 2      200 8M1D55M88S    <NA>  8M    1D    55M   88S  
#> 3      300 1S25M1S36M89S 1S    25M   1S    36M   89S

Created on 2022-03-12 by the reprex package (v2.0.1)

I want my data to look like this

df1
#>   position correction    col1  col2  col3  col4  col5     inner_S
#>      <dbl> <chr>         <chr> <chr> <chr> <chr> <chr>      
#> 1      100 62M89S        <NA>  <NA>  <NA>  62M   89S         NO 
#> 2      200 8M1D55M88S    <NA>  8M    1D    55M   88S         NO
#> 3      300 1S25M1S36M89S 1S    25M   1S    36M   89S         YES

since there is the character 1S in col3.

I tried this, but I am sure I am violating some properties

df1 %>% 
  mutate_at(vars(col2:col4),
                             ~inner_S=case_when(grepl("S",.) ~ "Yes", 
                              TRUE ~ "No"
                            ))

CodePudding user response:

dplyr's c_across is very handy for operations like this:

df1 %>% 
  rowwise() %>% 
  mutate(inner_S = ifelse(any(grepl('S', c_across(col1:col4))), 'YES', 'NO'))

  position correction    col1  col2  col3  col4  col5  inner_S
     <dbl> <chr>         <chr> <chr> <chr> <chr> <chr> <chr>  
1      100 62M89S        NA    NA    NA    62M   89S   NO     
2      200 8M1D55M88S    NA    8M    1D    55M   88S   NO     
3      300 1S25M1S36M89S 1S    25M   1S    36M   89S   YES    

CodePudding user response:

Please find another possible solution using data.table

Reprex

  • Code
library(data.table)

setDT(df1)[, inner_S := apply(.SD, 1, function(x) fifelse(any(grepl("S", x)), "YES", "NO")), .SDcols = paste0("col", 2:4)][]
  • Output
#>    position    correction col1 col2 col3 col4 col5 inner_S
#> 1:      100        62M89S <NA> <NA> <NA>  62M  89S      NO
#> 2:      200    8M1D55M88S <NA>   8M   1D  55M  88S      NO
#> 3:      300 1S25M1S36M89S   1S  25M   1S  36M  89S     YES

Created on 2022-03-12 by the reprex package (v2.0.1)

CodePudding user response:

Answer by jdobres is perfect. Here is a more complicated one using across:

library(tidyverse)

df1 %>% 
  mutate(across(col1:col4, ~ifelse(
    str_detect(., 'S'), TRUE, FALSE), .names = 'new_{col}')) %>% 
  unite(inner_S, starts_with('new'), na.rm = TRUE, sep = ' ') %>% 
  mutate(inner_S = ifelse(str_detect(inner_S, 'TRUE'), "YES", "NO"))
  position correction    col1  col2  col3  col4  col5  inner_S
     <dbl> <chr>         <chr> <chr> <chr> <chr> <chr> <chr>  
1      100 62M89S        NA    NA    NA    62M   89S   NO     
2      200 8M1D55M88S    NA    8M    1D    55M   88S   NO     
3      300 1S25M1S36M89S 1S    25M   1S    36M   89S   YES    

CodePudding user response:

An alternative to using rowwise approach which can be quite costly when working with larger data sets is to sum the TRUE values.

df1 %>%
      mutate(inner_S = ifelse(rowSums(across(col1:col4, str_detect, "S"), na.rm = T) > 1, "YES", "NO"))
  • Related