I have the following problem. In my dataset A I have a column which contains description of infrastructural projects. These are string of text which contain various infos about the project including sometimes the name of locality or multiple localities where the project was implemented. I have a separate dataset B which is basically a single column with the name of each locality (there are around 3000 of these as these are quite small).
Dataset A (14000 observations) looks something like:
Project ID | Description |
---|---|
1 | Road Construction in locality 1 |
2 | Park construction in locality 2 and locality 3 |
3 | Aqueduct construction in locality 1 and locality 4 and locality 5 |
4 | Park construction |
As you can see some localities are repeated (also some observations do not mention the locality name).
While Dataset B looks like:
Locality Name |
---|
locality 1 |
locality 2 |
locality 3 |
locality 4 |
locality 5 |
Basically what I'd like to achieve is to create a new column in dataset A for each locality mentioned in the description column.
Project ID | Description | NewCol1 | NewCol2 | NewCol3 |
---|---|---|---|---|
1 | Road Construction in locality 1 | locality 1 | NA | NA |
2 | Park construction in locality 2 and locality 3 | locality 2 | locality 3 | NA |
3 | Aqueduct construction in locality 1 and locality 4 and locality 5 | locality 1 | locality 4 | locality 5 |
4 | Park construction | NA | NA | NA |
However, I realise this might be quite complicated to achieve with R, so I would also be happy with having a new column listing all the localities mentioned in the description column, like:
Project ID | Description | NewCol1 |
---|---|---|
1 | Road Construction in locality 1 | locality 1 |
2 | Park construction in locality 2 and locality 3 | locality 2 locality 3 |
3 | Aqueduct construction in locality 1 and locality 4 and locality 5 | locality 1 locality 4 locality 5 |
4 | Park construction | NA |
How would I achieve this in R? Thank you in advance for your help!
CodePudding user response:
df %>%
cbind(
stringr::str_extract_all(.$Description,
"locality [0-9] ",
simplify = TRUE))
Project_ID Description 1 2 3
1 1 Road Construction in locality 1 locality 1
2 2 Park construction in locality 2 and locality 3 locality 2 locality 3
3 3 Aqueduct construction in locality 1 and locality 4 and locality 5 locality 1 locality 4 locality 5
4 4 Park construction
CodePudding user response:
library(dplyr)
library(stringr)
library(tidyr)
#get these location from dataframe B - location = df.B$location.name
locations = c("locality 1","locality 2","locality 3","locality 4","locality 5")
#add word boundaries to locations regex
locations_regex <- paste0("\\b",paste0(locations,collapse = "\\b|\\b"),"\\b")
df <- data.frame(
Project.ID = 1:4,
Description = c("Road Construction in locality 1",
"Park construction in locality 2 and locality 3",
"Aqueduct construction in locality 1 and locality 4 and locality 5",
"Park construction")
)
#the regex is agnostic to "locality" and will take any string in dataframe B for location names
df %>%
mutate(locations = str_extract_all(Description,locations_regex,simplify=T))