Home > Software engineering >  string extraction with regular expressions - str_extract, stringr, regex
string extraction with regular expressions - str_extract, stringr, regex

Time:12-16

I'm struggling with a string extract problem - see example below. If you could help me, I'd be most grateful!

Note: apologies for my lack of regex knowledge here

Objective: I'm trying to extract a match in text between from a reference vector to a target vector, and create a new variable within the table assigning the text from the reference text.

Example of Target Data Frame, Search Text and attempted method so far:

a <- c(1, 2, 3, 4, 5, 6, 7)
b <- c('TC2', 'TC25', 'TC255', 'Tops', 'TC2_', 'TC2   ', 'TC2555')

df <- data.frame(a, b)

search_text <- c('TC2', 'TC255')

search_string <- paste(paste0(search_text, '[regexp]'), sep = "", collapse = "|")

df %>% 
  mutate(match = str_extract(b, search_string))

[regexp] denotes the various things I've tried to try to get this method to work....its included all sorts of 'hair-brained ideas' like '\\d?'and so on (more combinations of this and similar than I care to remember). As you might imagine to no avail.

Desired Output:

Ultimately I'd like to get to this....

a <- c(1, 2, 3, 4, 5, 6, 7)
b <- c('TC2', 'TC25', 'TC255', 'Tops', 'TC2_', 'TC2   ', 'TC2555')
match <- c('TC2', NA_character_, 'TC255', NA_character_, 'TC2', 'TC2', NA_character_)

df_desired <- data.frame(a, b, match)

Your help would be greatly appreciated

Many Thanks

CodePudding user response:

search_string <- paste0("(", paste(search_text, collapse = "|"), ")(?![A-Za-z0-9])")
search_string
# [1] "(TC2|TC255)(?![A-Za-z0-9])"

df_desired %>%
  mutate(match2 = str_extract(b, search_string))
#   a      b match match2
# 1 1    TC2   TC2    TC2
# 2 2   TC25  <NA>   <NA>
# 3 3  TC255 TC255  TC255
# 4 4   Tops  <NA>   <NA>
# 5 5   TC2_   TC2    TC2
# 6 6 TC2      TC2    TC2
# 7 7 TC2555  <NA>   <NA>

This is using negative lookahead, which attempts to match but does not include that lookahead-match in the extracted text.

(I initially thought to use \\b for a word-boundary, but the _ did not fit.)

  • Related