I have a requirement to identify specific words and combinations of specific words within a free text description column. My dataset contains two columns - a reference number and description. The data relates to repairs. I need to be able to determine which room the repair took place in for each reference number. This could include “kitchen”, “bathroom”, “dining room” amongst others.
The dataset looks like this
|reference|description |
|————————-|———————————————————————-|
|123456 |repair light in kitchen |
The output I require is something like this:
|reference|Room |
|————————-|————————|
|123456 |kitchen |
Any help very much appreciated.
CodePudding user response:
This will pull the first match from room_vector
in each description.
room_vector = c("kitchen", "bathroom", "dining room")
library(stringr)
your_data$room = str_extract(your_data$description, paste(room_vector, collapse = "|"))
CodePudding user response:
This version takes into account the combination with the word repair
:
library(dplyr)
library(stringr)
my_vector <- c("kitchen", "bathroom", "dining room")
pattern <- paste(my_vector, collapse = "|")
df %>%
mutate(Room = case_when(
str_detect(description, "repair") &
str_detect(description, pattern) ~ str_extract(description, pattern)))
If you apply the code to this dataframe:
reference description
1 123456 live in light in kitchen
you will get:
reference description Room
1 123456 live in light in kitchen <NA>
First version does not take the combination with the word repair
into account:
Similar to Gregor Thomas solution:
library(dplyr)
library(stringr)
my_vector <- c("kitchen", "bathroom", "dining room")
pattern <- paste(my_vector, collapse = "|")
df %>%
mutate(Room = case_when(
str_detect(description, "repair") |
str_detect(description, pattern) ~ str_extract(description, pattern)))
reference description Room
1 123456 repair light in kitchen kitchen
CodePudding user response:
Using Base R:
rooms <- c("kitchen", "bathroom", "dining room")
pat <- sprintf('.*repair.*(%s).*|.*', paste0(rooms, collapse = '|'))
transform(df, room = sub(pat, '\\1', reference))
reference room
1 repair bathroom bathroom
2 live bathroom
3 repair lights in kitchen kitchen
4 food in kitchen
5 tv in dining room
6 table repair dining room dining room
Data:
df <- structure(list(reference = c("repair bathroom", "live bathroom",
"repair lights in kitchen", "food in kitchen", "tv in dining room",
"table repair dining room ")), class = "data.frame", row.names = c(NA,
-6L))