Home > Blockchain >  String matching over multiple columns with specific string names
String matching over multiple columns with specific string names

Time:10-18

I am interested in doing string detection over a set of columns. If that string (which in this case is ZSD) is found, I want to return the column number/name. If multiple matches are found, I want to return the last column name/number with that string.

Input

My input is this:

a.zsd                b.zsd   c.zsd   d.zsd
'ZSD'                'ZAD'   NA      'ZAD'
'ZAD'                NA      NA      'ZSD'
NA                   NA      'ZAD'   NA
'Not Achieved ZSD'   NA      'ZAD'   NA
'ZSD'                'ZSD'   NA      'ZSD'
NA                   NA      NA      NA

Output

My required output is a new column zsd.level:

a.zsd                b.zsd   c.zsd   d.zsd zsd.level
'ZSD'                'ZAD'   NA      'ZAD'    a
'ZAD'                NA      NA      'ZSD'    d
NA                   NA      'ZAD'   NA       NA
'Not Achieved ZSD'   NA      'ZAD'   NA       a
'ZSD'                'ZSD'   NA      'ZSD'    d
NA                   NA      NA      NA       NA

Info:

My data frame has over a hundred columns. I am interested in ONLY some of the columns having a name that ends at a string .zsd. These columns can either have NA or one of the following string values ZAD, ZSD, Not Achieved ZSD.

I am just interested in detecting the presence of the string ZSD. If not found in any of the columns, it should return NA in the output column (zsd.level). If the string is found in multiple columns, I want to return the last column name/number that contains the string.

My question is similar to this post but not exactly the same dplyr filter with condition on multiple columns

dput

dput(df)

structure(list(a.zsd = c("ZSD", "ZAD", NA, "Not Achieved ZSD", "ZSD", NA), 
               b.zsd = c("ZAD", NA, NA, NA, "ZSD", NA), 
               c.zsd = c(NA, NA, "ZAD", "ZAD", NA, NA), 
               d.zsd = c("ZAD", "ZSD", NA, NA, "ZSD", NA)), 
               class = "data.frame", row.names = c(NA, -6L))

Partial Solution

To select those columns with names ending in .zsd, I can do

library(stringr)
library(tidyverse)

df %>%
  select(ends_with(".zsd"))

To select or filter rows with the string ZSD, I can do

str_detect(., "ZSD"))

But how can I put multiple conditions together? Any help would be greatly appreciated.

CodePudding user response:

We could do it this way:

library(dplyr)
library(tidyr)
library(stringr)

df %>%  
  mutate(across(contains("zsd"), ~case_when(str_detect(., "ZSD") ~ cur_column()), .names = 'new_{col}')) %>%
  unite(zsd_level, starts_with('new'), na.rm = TRUE, sep = ' ') %>% 
  mutate(zsd_level = str_remove_all(zsd_level, ".zsd"),
         zsd_level = str_sub(zsd_level, -1))

output:

 a.zsd b.zsd c.zsd d.zsd zsd_level
1              ZSD   ZAD  <NA>   ZAD         a
2              ZAD  <NA>  <NA>   ZSD         d
3             <NA>  <NA>   ZAD  <NA>          
4 Not Achieved ZSD  <NA>   ZAD  <NA>         a
5              ZSD   ZSD  <NA>   ZSD         d
6             <NA>  <NA>  <NA>  <NA>          

CodePudding user response:

Another option, but a little bit more complicated than dear TarJae's:

library(dplyr)
library(tidyr)
library(stringr)

df %>% 
  mutate(rn = row_number()) %>% 
  pivot_longer(-rn) %>% 
  group_by(rn) %>% 
  filter(str_detect(value, "ZSD")) %>% 
  slice(1) %>% 
  summarise(name = str_remove(name, ".zsd")) %>% 
  right_join(df %>% mutate(rn = row_number()), by = "rn") %>%
  arrange(rn) %>% 
  ungroup() %>% 
  select(ends_with("zsd"), zsd.level = name)

This returns

# A tibble: 6 x 5
  a.zsd            b.zsd c.zsd d.zsd zsd.level
  <chr>            <chr> <chr> <chr> <chr>    
1 ZSD              ZAD   NA    ZAD   a        
2 ZAD              NA    NA    ZSD   d        
3 NA               NA    ZAD   NA    NA       
4 Not Achieved ZSD NA    ZAD   NA    a        
5 ZSD              ZSD   NA    ZSD   a        
6 NA               NA    NA    NA    NA 
  • Related