Using grep can't find exact match when string contains parentheses ()-CodePudding

I have the following df

A                                                                          B
"Axon guidance"                                                            1                                                                                                
"Chemical carcinogenesis - reactive oxygen species"                        2                                                           
"Electron Transport Chain (OXPHOS system in mitochondria)"                 3                                                                                                                                           
"The citric acid (TCA) cycle and respiratory electron transport"           4

Using

 grep(paste0("^", df[3,1], "$"), df[,1]))

Gives 0

Using

 grep(paste0("^", df[2,1], "$"), df[,1]))

Finds the exact match (integer which is the line containing the match)

Why grep can't get an exact match when using with strings that contains parentheses?

CodePudding user response：

Because parentheses needs double backlashes to be matched as such in R. Otherwise they are understood as special characters.

grep(paste0("\\)$"), df[,1])
#[1] 3

As stated in the comments, you could also fixed = TRUE in the grep function to match the string as is

grep(df[3,1], df[,1], fixed = TRUE)
#[1] 3

To find an exact matching you can simply use which:

which(df[,1] == df[3,1])

CodePudding user response：

As already noted, the problem here is that round brackets are control characters used to define capture groups in RegEx search patterns.

Two approaches you may wish to consider are:

Sanitise the text being searched and the text used to create search patterns of the relevant characters
Double escape the RegEx control characters in the search patterns

Generate Sample Data

df <- data.frame(A=c("Axon guidance", 
                     "Chemical carcinogenesis - reactive oxygen species", 
                     "Electron Transport Chain (OXPHOS system in mitochondria)",
                     "The citric acid (TCA) cycle and respiratory electron transport"),
                 B=1:4)

Demonstrate problem

grep(paste0("^", df[2,1], "$"), df[,1]) # <- the OP has an extra bracket here
grep(paste0("^", df[3,1], "$"), df[,1])

Option 1

Here we sanitise both the text being searched & the patterns used to search

Here we are just sanitising for round brackets but there are other special characters in regex (and cases where complex unicode characters also create problems)

df$sanitised_text <- gsub("[()]*", "", df$A)

Demonstrate Solution

grep(paste0("^", df[2, "sanitised_text"], "$"), df[,"sanitised_text"]) 
grep(paste0("^", df[3,"sanitised_text"], "$"), df[,"sanitised_text"])

Option 2 - Double escape the regex control characters

sanitise_search_patterns <- function(x){
  y <- gsub("\\(", "\\\\(", x)
  gsub("\\)", "\\\\)", y)
}
 
df$sanitised_search_patterns <- sanitise_search_patterns(df$A)

Demonstrate Solution

grep(paste0("^", df[2, "sanitised_search_patterns"], "$"), df[,"A"]) 
grep(paste0("^", df[3,"sanitised_search_patterns"], "$"), df[,"A"])

You could use either approach here but there are cases where non-control characters can create similar types of false negatives - e.g. a multiplicity of unicode characters for whitespace, hyphens and complex characters formed from more than one glyph - so sanitising the search text might still be usefully considered alongside double escaping.

CodePudding user response：

We can use

grep("[)]$]", df[,1])