I am interested in doing string detection over a set of columns. If that string (which in this case is ZSD
) is found, I want to return the column number/name. If multiple matches are found, I want to return the last column name/number with that string.
Input
My input is this:
a.zsd b.zsd c.zsd d.zsd
'ZSD' 'ZAD' NA 'ZAD'
'ZAD' NA NA 'ZSD'
NA NA 'ZAD' NA
'Not Achieved ZSD' NA 'ZAD' NA
'ZSD' 'ZSD' NA 'ZSD'
NA NA NA NA
Output
My required output is a new column zsd.level
:
a.zsd b.zsd c.zsd d.zsd zsd.level
'ZSD' 'ZAD' NA 'ZAD' a
'ZAD' NA NA 'ZSD' d
NA NA 'ZAD' NA NA
'Not Achieved ZSD' NA 'ZAD' NA a
'ZSD' 'ZSD' NA 'ZSD' d
NA NA NA NA NA
Info:
My data frame has over a hundred columns. I am interested in ONLY some of the columns having a name that ends at a string .zsd
. These columns can either have NA
or one of the following string values ZAD
, ZSD
, Not Achieved ZSD
.
I am just interested in detecting the presence of the string ZSD
. If not found in any of the columns, it should return NA in the output column (zsd.level
). If the string is found in multiple columns, I want to return the last column name/number that contains the string.
My question is similar to this post but not exactly the same dplyr filter with condition on multiple columns
dput
dput(df)
structure(list(a.zsd = c("ZSD", "ZAD", NA, "Not Achieved ZSD", "ZSD", NA),
b.zsd = c("ZAD", NA, NA, NA, "ZSD", NA),
c.zsd = c(NA, NA, "ZAD", "ZAD", NA, NA),
d.zsd = c("ZAD", "ZSD", NA, NA, "ZSD", NA)),
class = "data.frame", row.names = c(NA, -6L))
Partial Solution
To select those columns with names ending in .zsd
, I can do
library(stringr)
library(tidyverse)
df %>%
select(ends_with(".zsd"))
To select or filter rows with the string ZSD
, I can do
str_detect(., "ZSD"))
But how can I put multiple conditions together? Any help would be greatly appreciated.
CodePudding user response:
We could do it this way:
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(across(contains("zsd"), ~case_when(str_detect(., "ZSD") ~ cur_column()), .names = 'new_{col}')) %>%
unite(zsd_level, starts_with('new'), na.rm = TRUE, sep = ' ') %>%
mutate(zsd_level = str_remove_all(zsd_level, ".zsd"),
zsd_level = str_sub(zsd_level, -1))
output:
a.zsd b.zsd c.zsd d.zsd zsd_level
1 ZSD ZAD <NA> ZAD a
2 ZAD <NA> <NA> ZSD d
3 <NA> <NA> ZAD <NA>
4 Not Achieved ZSD <NA> ZAD <NA> a
5 ZSD ZSD <NA> ZSD d
6 <NA> <NA> <NA> <NA>
CodePudding user response:
Another option, but a little bit more complicated than dear TarJae's:
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
group_by(rn) %>%
filter(str_detect(value, "ZSD")) %>%
slice(1) %>%
summarise(name = str_remove(name, ".zsd")) %>%
right_join(df %>% mutate(rn = row_number()), by = "rn") %>%
arrange(rn) %>%
ungroup() %>%
select(ends_with("zsd"), zsd.level = name)
This returns
# A tibble: 6 x 5
a.zsd b.zsd c.zsd d.zsd zsd.level
<chr> <chr> <chr> <chr> <chr>
1 ZSD ZAD NA ZAD a
2 ZAD NA NA ZSD d
3 NA NA ZAD NA NA
4 Not Achieved ZSD NA ZAD NA a
5 ZSD ZSD NA ZSD a
6 NA NA NA NA NA