Home > database >  Identify specific words and combinations of specific words in R
Identify specific words and combinations of specific words in R

Time:12-17

I have a requirement to identify specific words and combinations of specific words within a free text description column. My dataset contains two columns - a reference number and description. The data relates to repairs. I need to be able to determine which room the repair took place in for each reference number. This could include “kitchen”, “bathroom”, “dining room” amongst others.

The dataset looks like this

|reference|description             |
|————————-|———————————————————————-|
|123456   |repair light in kitchen |

The output I require is something like this:

|reference|Room    |
|————————-|————————|
|123456   |kitchen |

Any help very much appreciated.

CodePudding user response:

This will pull the first match from room_vector in each description.

room_vector = c("kitchen", "bathroom", "dining room")

library(stringr)
your_data$room = str_extract(your_data$description, paste(room_vector, collapse = "|"))

CodePudding user response:

This version takes into account the combination with the word repair:

library(dplyr)
library(stringr)

my_vector <- c("kitchen", "bathroom", "dining room")
pattern <- paste(my_vector, collapse = "|")

df %>% 
  mutate(Room = case_when(
    str_detect(description, "repair") &
      str_detect(description, pattern) ~ str_extract(description, pattern)))

If you apply the code to this dataframe:

  reference              description
1    123456 live in light in kitchen

you will get:

  reference              description Room
1    123456 live in light in kitchen <NA>

First version does not take the combination with the word repair into account: Similar to Gregor Thomas solution:

library(dplyr)
library(stringr)

my_vector <- c("kitchen", "bathroom", "dining room")
pattern <- paste(my_vector, collapse = "|")

df %>% 
  mutate(Room = case_when(
    str_detect(description, "repair") |
      str_detect(description, pattern) ~ str_extract(description, pattern)))
  reference             description    Room
1    123456 repair light in kitchen kitchen

CodePudding user response:

Using Base R:

rooms <- c("kitchen", "bathroom", "dining room")  
pat <- sprintf('.*repair.*(%s).*|.*', paste0(rooms, collapse = '|'))
transform(df, room = sub(pat, '\\1', reference))

                  reference        room
1           repair bathroom    bathroom
2             live bathroom            
3  repair lights in kitchen     kitchen
4           food in kitchen            
5         tv in dining room            
6 table repair dining room  dining room

Data:

df <- structure(list(reference = c("repair bathroom", "live bathroom", 
"repair lights in kitchen", "food in kitchen", "tv in dining room", 
"table repair dining room ")), class = "data.frame", row.names = c(NA, 
-6L))
  •  Tags:  
  • r
  • Related