I'm struggling with a string extract problem - see example below. If you could help me, I'd be most grateful!
Note: apologies for my lack of regex knowledge here
Objective: I'm trying to extract a match in text between from a reference vector to a target vector, and create a new variable within the table assigning the text from the reference text.
Example of Target Data Frame, Search Text and attempted method so far:
a <- c(1, 2, 3, 4, 5, 6, 7)
b <- c('TC2', 'TC25', 'TC255', 'Tops', 'TC2_', 'TC2 ', 'TC2555')
df <- data.frame(a, b)
search_text <- c('TC2', 'TC255')
search_string <- paste(paste0(search_text, '[regexp]'), sep = "", collapse = "|")
df %>%
mutate(match = str_extract(b, search_string))
[regexp] denotes the various things I've tried to try to get this method to work....its included all sorts of 'hair-brained ideas' like '\\d?'and so on (more combinations of this and similar than I care to remember). As you might imagine to no avail.
Desired Output:
Ultimately I'd like to get to this....
a <- c(1, 2, 3, 4, 5, 6, 7)
b <- c('TC2', 'TC25', 'TC255', 'Tops', 'TC2_', 'TC2 ', 'TC2555')
match <- c('TC2', NA_character_, 'TC255', NA_character_, 'TC2', 'TC2', NA_character_)
df_desired <- data.frame(a, b, match)
Your help would be greatly appreciated
Many Thanks
CodePudding user response:
search_string <- paste0("(", paste(search_text, collapse = "|"), ")(?![A-Za-z0-9])")
search_string
# [1] "(TC2|TC255)(?![A-Za-z0-9])"
df_desired %>%
mutate(match2 = str_extract(b, search_string))
# a b match match2
# 1 1 TC2 TC2 TC2
# 2 2 TC25 <NA> <NA>
# 3 3 TC255 TC255 TC255
# 4 4 Tops <NA> <NA>
# 5 5 TC2_ TC2 TC2
# 6 6 TC2 TC2 TC2
# 7 7 TC2555 <NA> <NA>
This is using negative lookahead, which attempts to match but does not include that lookahead-match in the extracted text.
(I initially thought to use \\b
for a word-boundary, but the _
did not fit.)