Home > Software design >  Extract the longest matching string R
Extract the longest matching string R

Time:11-04

I would like to extract from a column the names matching the names contained in a large vector of characters. In some cases, the extracted string is not complete because of whitespaces.

Here below a replicable example:

library(stringr)
library(dplyr)
library(tidyr)
library(stringi)

data <- data.frame (address  = c("to New York street", "to New cafe", "to Paris avenue", "to London hostel"))

search_string<-c("London","Paris", "New", "New York")%>% paste(collapse = " |to ")

data %>% dplyr::mutate(temp_com = str_extract_all(paste(address), search_string)) 

This is the results :

             address  temp_com
1 to New York street   to New 
2        to New cafe   to New 
3    to Paris avenue to Paris 
4   to London hostel   London 

And this is what I would like:

             address  temp_com
1 to New York street   to New York 
2        to New cafe   to New 
3    to Paris avenue to Paris 
4   to London hostel   London 

Thank you very much for your help

CodePudding user response:

Change the order of your search strings to longest-to-shortest. (Also, I'm inferring you intend to have "to " before your first search string, it's being omitted in your current example.)

search_string <- c("London","Paris", "New", "New York")
search_string <- paste(paste("to", search_string[order(-nchar(search_string))]), collapse = "|")
search_string
# [1] "to New York|to London|to Paris|to New"

data %>%
  dplyr::mutate(temp_com = str_extract_all(paste(address), search_string)) 
#              address    temp_com
# 1 to New York street to New York
# 2        to New cafe      to New
# 3    to Paris avenue    to Paris
# 4   to London hostel   to London
  •  Tags:  
  • r
  • Related