Regex using back references in R-CodePudding

I wrote regex in https://regex101.com/r/R8ObNk/1 (^[^\\]*)\\t([^\\]*)\\t([^\\]*)\\t([^\\]*)\\t([^\\]*)(.*) with a back reference to capture group 5 or "\5".

For some reason, when I try to use the regex above that I wrote in R using gsub, I am not returning the correct data.

Here is the dput for first line of the data that I am trying to back reference:

structure(list(value = "19-22\t\t4\tP,G\tDOB_TT\t\tTime of Birth\t\t126\t \t0000-2359 Time of Birth"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

This is the gsub on the line above: gsub(pattern = "(^[^\\]*)\\t([^\\]*)\\t([^\\]*)\\t([^\\]*)\\t([^\\]*)(.*)", replacement = "\\5", x = a$value). I do know you're supposed to add another "\" when working with regex in R, but still that didn't work.

The intended result of the gsub should be "DOB_TT" or the 5th capture group

CodePudding user response：

You don't actually need regexes in this case, since your data is structured:

parsed <- read.delim(text=a$value, header=FALSE)
parsed$V5
# [1] "DOB_TT"

CodePudding user response：

You need to be careful with escape characters. Note that R uses extra "" in strings that will not be understood by the website. And when you see a string like

x <- "a\tb"

in R, there is no literal slash in the string. The \t is the escape for a tab character. So nchar(x) return 3, not 4 because those two values together make one tab character. So given your data, what you really want is

gsub(pattern = "(^[^\t]*)\t([^\t]*)\t([^\t]*)\t([^\t]*)\t([^\t]*)(.*)",
  replacement = "\\5", x = a$value)

You do not need extra \ for the tabs because tab characters aren't special in a regular expression. They are just regular characters.