I have an example project and need to search for strings using the stringr
package. In the example, to eliminate other case spellings I started with str_to_lower(example$remarks)
, which made the remarks all lower case. The remarks column describes residential properties.
I need to search for the word "shop". However, the word "shopping" is also in the remarks column and I don't want that word.
Some observations: a) Have only the word "shop"; b) Have only the word "shopping"; c) Have neither the words "shop" or "shopping"; d) Have BOTH the words "shop" & "shopping".
When using str_detect()
, I want it to give me a TRUE
for detecting the word "shop", but I DO NOT want it to give me a TRUE
for detecting the string "shop" within the word "shopping". Currently, if I run str_detect(example$remarks, "shop")
I get a TRUE
for both the words "shop" and "shopping". Effectively, I ONLY want a TRUE
for the 4-character string "shop" and if the characters "shop" appear but have any other characters after it like shop(ping), I want the code to exclude detecting it and not identifying it as TRUE
.
Also, if the remarks contain BOTH the words "shop" and "shopping", I would like the result to be TRUE
only for detecting "shop" but not "shopping".
Ultimately, I'm hoping one line of code using str_detect()
can give me the result of:
- If the remarks observation has only the word "shop" =
TRUE
- If the remarks observation has only the word "shopping" =
FALSE
- If the remarks observation has neither the words "shop" or "shopping" =
FALSE
- If the remarks observation has both the words "shop" AND "shopping" =
TRUE
for detecting ONLY the 4-character string "shop" and it DOES not output aTRUE
because of the word "shopping".
I need all of the observations to remain in the dataset and cannot exclude them because I need to create a new column, which I have labeled shop_YN
, that give a "Yes" for observations with only the 4-character string "shop". Once I have the correct str_detect()
code, I plan to wrap the results in a mutate()
and if_else()
function as follows (except I don't know what to code to use inside str_detect()
to get the results I need):
shop_YN <- example %>% mutate(shop_YN = if_else(str_detect(example$remarks, ), "Yes", "No"))
Here is a sample of the data using the dput()
:
structure(list(price = c(195000, 213000, 215000, 240000, 241000,
250000, 255000, 256500, 260000, 263500, 265000, 277000, 280000,
280000, 150000), remarks = c("large home with a 1200 sf shop. great location close to shopping.",
"updated home close to shopping & schools.", "nice location. 2br home with updating.",
"huge shop on property!", "close to shopping.", "updated, clean, great location, garage.",
"close to shopping and massive shop on property.", "updated home near shopping, schools, restaurants.",
"large home with updated interior.", "close to schools, updated, stick-built shop 1500sf.",
"home and shop.", "near schools, shopping, restaurants. partially updated home.",
"located close to shopping. high quality home with shop in backyard.",
"brick 2-story. lots of shopping near by. detached garage and large shop in backyard.",
"fixer! needs work.")), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
CodePudding user response:
You are probably looking for a word boundary here (\\b
). Wrap the desired pattern between two word boundaries to match just the word, but not parts of longer words.
library(dplyr)
library(sitrngr)
df %>% mutate(shop_YN = str_detect(remarks, '\\bshop\\b'))
# A tibble: 15 × 3
price remarks shop_YN
<dbl> <chr> <lgl>
1 195000 large home with a 1200 sf shop. great location close to shopping. TRUE
2 213000 updated home close to shopping & schools. FALSE
3 215000 nice location. 2br home with updating. FALSE
4 240000 huge shop on property! TRUE
5 241000 close to shopping. FALSE
6 250000 updated, clean, great location, garage. FALSE
7 255000 close to shopping and massive shop on property. TRUE
8 256500 updated home near shopping, schools, restaurants. FALSE
9 260000 large home with updated interior. FALSE
10 263500 close to schools, updated, stick-built shop 1500sf. TRUE
11 265000 home and shop. TRUE
12 277000 near schools, shopping, restaurants. partially updated home. FALSE
13 280000 located close to shopping. high quality home with shop in backyard. TRUE
14 280000 brick 2-story. lots of shopping near by. detached garage and large shop in back… TRUE
15 150000 fixer! needs work. FALSE
If you want Yes
or No
instead of the logical shop_YN, just pipe the output of str_detect
into ifelse
:
df %>% mutate(shop_YN = str_detect(remarks, '\\bshop\\b') %>% ifelse('Yes', 'No'))
CodePudding user response:
We could also use grepl
instead of str_detect
:
df %>%
mutate(check = grepl("\\bshop\\b", remarks))
price remarks check
<dbl> <chr> <lgl>
1 195000 large home with a 1200 sf shop. great location close to shopping. TRUE
2 213000 updated home close to shopping & schools. FALSE
3 215000 nice location. 2br home with updating. FALSE
4 240000 huge shop on property! TRUE
5 241000 close to shopping. FALSE
6 250000 updated, clean, great location, garage. FALSE
7 255000 close to shopping and massive shop on property. TRUE
8 256500 updated home near shopping, schools, restaurants. FALSE
9 260000 large home with updated interior. FALSE
10 263500 close to schools, updated, stick-built shop 1500sf. TRUE
11 265000 home and shop. TRUE
12 277000 near schools, shopping, restaurants. partially updated home. FALSE
13 280000 located close to shopping. high quality home with shop in backyard. TRUE
14 280000 brick 2-story. lots of shopping near by. detached garage and large shop in backyard. TRUE
15 150000 fixer! needs work. FALSE