I have a column that looks something like this
col1
"business"
"BusinesS"
"education"
"some BUSINESS ."
"business of someone, that is cool"
" not the b word"
"busi ness"
"busines."
"businesses"
"something else"
And I need an efficient way of getting all this string data into a new value
col1 col2
NA 1
NA 1
"education" NA
NA 1
NA 1
" not the b word" NA
NA 1
NA 1
NA 1
"something else" NA
So the common denominator is "busines", but I don't know how to efficiently make it sort out all the spaces, punctuation, lower/uppercases, other words etc. in one mutate that creates a new column.
CodePudding user response:
library(dplyr)
library(stringr)
df %>%
mutate(col2 = ifelse(str_detect(col1, "(?i)busi\\s?ness?"),
1,
NA)
We can use ifelse
to set 1
if str_detect
detects any form of business
, and NA
if it doesn't. Note that (?i)
makes the match case-insensitive and ?
in \\s?
and s?
makes the preceding item optional; so \\s?
matches an optional space and s?
matches an optional literal s
CodePudding user response:
You can replace all non word characters using gsub
and than use grepl
to detect busines
:
grepl("busines", gsub("\\W ", "", s), ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0
Another way would be to use agrepl
for Approximate String Matching, where here 1L
gives the maximum distance to the given pattern.
agrepl("busines", s, 1L, ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0
agrep
can also be a solution in case you are looking for business
instead of busines
:
agrepl("business", gsub("\\W ", "", s), 1L, ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0
Data:
s <- c("business","BusinesS","education","some BUSINESS .",
"business of someone, that is cool"," not the b word",
"busi ness","busines." ,"businesses","something else")