How to extract specific parts of a sentence using regex in R?-CodePudding

I have that:

res <- data.frame(id=c(1,2,3,4,5,6), text=c("(21) Nº do Pedido: BR 10 2016 015202 0 A2","(21) Nº do Pedido: (21) Nº do Pedido Anterior:BR 20 2016 011446 8 U2 BR 10 2016 011446 2","(21) Nº do Pedido: BR 10 2016 007903 9 A2","(21) Nº do Pedido: PI 1001284-2 A2","(21) Nº do Pedido: MU 8102871-7 U2","(21) Nº do Pedido: BR 10 2022 004466 0 (21) Nº do Pedido Anterior:BR 20 2016 011446 8 U2 BR 10 2016 011446 2"))

res %>% ifelse(
  stringr::str_subset(text, "^(21) Nº do Pedido: (21) Nº do Pedido Anterior:.*"),
  stringr::str_replace_all(text,".*:(.*)\\s{1,}[A-Z]", "") %>% stringr::str_trim( ),
stringr::str_replace_all(text,"^\(\\d{2,}\\). :\\s(. )\\(", "\\1") %>% stringr::str_trim( ))->res$text

expected output:

  id                   text
1  1 BR 10 2016 015202 0 A2
2  2 BR 20 2016 011446 8 U2
3  3 BR 10 2016 007903 9 A2
4  4 PI 1001284-2 A2
5  5 MU 8102871-7 U2
6  6 BR 10 2022 004466 0

Any idea how to solve this?

CodePudding user response：

Since your text column could contain more than 1 "BR", and you only want to output the first occurrence of it, I'll use a ifelse to have two different regex to catch it.

library(stringr)
library(dplyr)

res %>% 
  mutate(text = ifelse(str_count(text, "BR") > 1, 
                       gsub("^.*?(BR. (?= BR)).*$", "\\1", text, perl = T), 
                       gsub("^.*?(BR. ).*$", "\\1", text, perl = T)))

  id                   text
1  1 BR 10 2016 015202 0 A2
2  2 BR 20 2016 011446 8 U2
3  3 BR 10 2016 007903 9 A2

I've noticed that the part you would like to extract has a specific pattern, maybe we can catch that pattern using a single regex.

res %>% 
  mutate(text = gsub("^. ?([A-Z]{2}\\s[0-9]{2}\\s[0-9]{4}\\s[0-9]{6}\\s[0-9]{1}\\s[A-Z][0-9]).*?$", "\\1", text))

 id                   text
1  1 BR 10 2016 015202 0 A2
2  2 BR 20 2016 011446 8 U2
3  3 BR 10 2016 007903 9 A2

CodePudding user response：

this is simple

just catch like this

\(21\) Nº do Pedido(?: Anterior)?: ?(.*)

(\(21\) Nº do Pedido(?: Anterior)?: ?(.*))