Home > other >  stringr package using str_detect - Search for one word and exclude word
stringr package using str_detect - Search for one word and exclude word

Time:10-30

I have an example project and need to search for strings using the stringr package. In the example, to eliminate other case spellings I started with str_to_lower(example$remarks), which made the remarks all lower case. The remarks column describes residential properties.

I need to search for the word "shop". However, the word "shopping" is also in the remarks column and I don't want that word.

Some observations: a) Have only the word "shop"; b) Have only the word "shopping"; c) Have neither the words "shop" or "shopping"; d) Have BOTH the words "shop" & "shopping".

When using str_detect(), I want it to give me a TRUE for detecting the word "shop", but I DO NOT want it to give me a TRUE for detecting the string "shop" within the word "shopping". Currently, if I run str_detect(example$remarks, "shop") I get a TRUE for both the words "shop" and "shopping". Effectively, I ONLY want a TRUE for the 4-character string "shop" and if the characters "shop" appear but have any other characters after it like shop(ping), I want the code to exclude detecting it and not identifying it as TRUE.

Also, if the remarks contain BOTH the words "shop" and "shopping", I would like the result to be TRUE only for detecting "shop" but not "shopping".

Ultimately, I'm hoping one line of code using str_detect() can give me the result of:

  1. If the remarks observation has only the word "shop" = TRUE
  2. If the remarks observation has only the word "shopping" = FALSE
  3. If the remarks observation has neither the words "shop" or "shopping" = FALSE
  4. If the remarks observation has both the words "shop" AND "shopping" = TRUE for detecting ONLY the 4-character string "shop" and it DOES not output a TRUE because of the word "shopping".

I need all of the observations to remain in the dataset and cannot exclude them because I need to create a new column, which I have labeled shop_YN, that give a "Yes" for observations with only the 4-character string "shop". Once I have the correct str_detect() code, I plan to wrap the results in a mutate() and if_else() function as follows (except I don't know what to code to use inside str_detect() to get the results I need):

shop_YN <- example %>% mutate(shop_YN = if_else(str_detect(example$remarks, ), "Yes", "No"))

Here is a sample of the data using the dput():

structure(list(price = c(195000, 213000, 215000, 240000, 241000, 
                         250000, 255000, 256500, 260000, 263500, 265000, 277000, 280000, 
                         280000, 150000), remarks = c("large home with a 1200 sf shop. great location close to shopping.", 
                                                      "updated home close to shopping & schools.", "nice location. 2br home with updating.", 
                                                      "huge shop on property!", "close to shopping.", "updated, clean, great location, garage.", 
                                                      "close to shopping and massive shop on property.", "updated home near shopping, schools, restaurants.", 
                                                      "large home with updated interior.", "close to schools, updated, stick-built shop 1500sf.", 
                                                      "home and shop.", "near schools, shopping, restaurants. partially updated home.", 
                                                      "located close to shopping. high quality home with shop in backyard.", 
                                                      "brick 2-story. lots of shopping near by. detached garage and large shop in backyard.", 
                                                      "fixer! needs work.")), row.names = c(NA, -15L), class = c("tbl_df", 
                                                                                                                 "tbl", "data.frame"))

CodePudding user response:

You are probably looking for a word boundary here (\\b). Wrap the desired pattern between two word boundaries to match just the word, but not parts of longer words.

library(dplyr)
library(sitrngr)

df %>% mutate(shop_YN = str_detect(remarks, '\\bshop\\b'))

# A tibble: 15 × 3
    price remarks                                                                          shop_YN
    <dbl> <chr>                                                                            <lgl>  
 1 195000 large home with a 1200 sf shop. great location close to shopping.                TRUE   
 2 213000 updated home close to shopping & schools.                                        FALSE  
 3 215000 nice location. 2br home with updating.                                           FALSE  
 4 240000 huge shop on property!                                                           TRUE   
 5 241000 close to shopping.                                                               FALSE  
 6 250000 updated, clean, great location, garage.                                          FALSE  
 7 255000 close to shopping and massive shop on property.                                  TRUE   
 8 256500 updated home near shopping, schools, restaurants.                                FALSE  
 9 260000 large home with updated interior.                                                FALSE  
10 263500 close to schools, updated, stick-built shop 1500sf.                              TRUE   
11 265000 home and shop.                                                                   TRUE   
12 277000 near schools, shopping, restaurants. partially updated home.                     FALSE  
13 280000 located close to shopping. high quality home with shop in backyard.              TRUE   
14 280000 brick 2-story. lots of shopping near by. detached garage and large shop in back… TRUE   
15 150000 fixer! needs work.                                                               FALSE

If you want Yes or No instead of the logical shop_YN, just pipe the output of str_detect into ifelse:

df %>% mutate(shop_YN = str_detect(remarks, '\\bshop\\b') %>% ifelse('Yes', 'No'))

CodePudding user response:

We could also use grepl instead of str_detect:

df %>% 
  mutate(check = grepl("\\bshop\\b", remarks))
    price remarks                                                                              check
    <dbl> <chr>                                                                                <lgl>
 1 195000 large home with a 1200 sf shop. great location close to shopping.                    TRUE 
 2 213000 updated home close to shopping & schools.                                            FALSE
 3 215000 nice location. 2br home with updating.                                               FALSE
 4 240000 huge shop on property!                                                               TRUE 
 5 241000 close to shopping.                                                                   FALSE
 6 250000 updated, clean, great location, garage.                                              FALSE
 7 255000 close to shopping and massive shop on property.                                      TRUE 
 8 256500 updated home near shopping, schools, restaurants.                                    FALSE
 9 260000 large home with updated interior.                                                    FALSE
10 263500 close to schools, updated, stick-built shop 1500sf.                                  TRUE 
11 265000 home and shop.                                                                       TRUE 
12 277000 near schools, shopping, restaurants. partially updated home.                         FALSE
13 280000 located close to shopping. high quality home with shop in backyard.                  TRUE 
14 280000 brick 2-story. lots of shopping near by. detached garage and large shop in backyard. TRUE 
15 150000 fixer! needs work.                                                                   FALSE
  • Related