R: How can find matches in a character column based on each row of a second column?-CodePudding

I have the following problem. In my dataset A I have a column which contains description of infrastructural projects. These are string of text which contain various infos about the project including sometimes the name of locality or multiple localities where the project was implemented. I have a separate dataset B which is basically a single column with the name of each locality (there are around 3000 of these as these are quite small).

Dataset A (14000 observations) looks something like:

Project ID	Description
1	Road Construction in locality 1
2	Park construction in locality 2 and locality 3
3	Aqueduct construction in locality 1 and locality 4 and locality 5
4	Park construction

As you can see some localities are repeated (also some observations do not mention the locality name).

While Dataset B looks like:

Locality Name
locality 1
locality 2
locality 3
locality 4
locality 5

Basically what I'd like to achieve is to create a new column in dataset A for each locality mentioned in the description column.

Project ID	Description	NewCol1	NewCol2	NewCol3
1	Road Construction in locality 1	locality 1	NA	NA
2	Park construction in locality 2 and locality 3	locality 2	locality 3	NA
3	Aqueduct construction in locality 1 and locality 4 and locality 5	locality 1	locality 4	locality 5
4	Park construction	NA	NA	NA

However, I realise this might be quite complicated to achieve with R, so I would also be happy with having a new column listing all the localities mentioned in the description column, like:

Project ID	Description	NewCol1
1	Road Construction in locality 1	locality 1
2	Park construction in locality 2 and locality 3	locality 2 locality 3
3	Aqueduct construction in locality 1 and locality 4 and locality 5	locality 1 locality 4 locality 5
4	Park construction	NA

How would I achieve this in R? Thank you in advance for your help!

CodePudding user response：

df %>%
  cbind(
    stringr::str_extract_all(.$Description, 
                             "locality [0-9] ", 
                             simplify = TRUE))

  Project_ID                                                       Description          1          2          3
1          1                                   Road Construction in locality 1 locality 1                      
2          2                    Park construction in locality 2 and locality 3 locality 2 locality 3           
3          3 Aqueduct construction in locality 1 and locality 4 and locality 5 locality 1 locality 4 locality 5
4          4                                                 Park construction

CodePudding user response：

library(dplyr)
library(stringr)
library(tidyr)

#get these location from dataframe B - location = df.B$location.name
locations = c("locality 1","locality 2","locality 3","locality 4","locality 5")

#add word boundaries to locations regex
locations_regex <- paste0("\\b",paste0(locations,collapse = "\\b|\\b"),"\\b")

df <- data.frame(
  Project.ID = 1:4,
  Description = c("Road Construction in locality 1",
                  "Park construction in locality 2 and locality 3",
                  "Aqueduct construction in locality 1 and locality 4 and locality 5",
                  "Park construction")
)

#the regex is agnostic to "locality" and will take any string in dataframe B for location names
df %>%
  mutate(locations = str_extract_all(Description,locations_regex,simplify=T))