I have the following df
A B
"Axon guidance" 1
"Chemical carcinogenesis - reactive oxygen species" 2
"Electron Transport Chain (OXPHOS system in mitochondria)" 3
"The citric acid (TCA) cycle and respiratory electron transport" 4
Using
grep(paste0("^", df[3,1], "$"), df[,1]))
Gives 0
Using
grep(paste0("^", df[2,1], "$"), df[,1]))
Finds the exact match (integer which is the line containing the match)
Why grep can't get an exact match when using with strings that contains parentheses?
CodePudding user response:
Because parentheses needs double backlashes to be matched as such in R. Otherwise they are understood as special characters.
grep(paste0("\\)$"), df[,1])
#[1] 3
As stated in the comments, you could also fixed = TRUE
in the grep
function to match the string as is
grep(df[3,1], df[,1], fixed = TRUE)
#[1] 3
To find an exact matching you can simply use which
:
which(df[,1] == df[3,1])
CodePudding user response:
As already noted, the problem here is that round brackets are control characters used to define capture groups in RegEx search patterns.
Two approaches you may wish to consider are:
- Sanitise the text being searched and the text used to create search patterns of the relevant characters
- Double escape the RegEx control characters in the search patterns
Generate Sample Data
df <- data.frame(A=c("Axon guidance",
"Chemical carcinogenesis - reactive oxygen species",
"Electron Transport Chain (OXPHOS system in mitochondria)",
"The citric acid (TCA) cycle and respiratory electron transport"),
B=1:4)
Demonstrate problem
grep(paste0("^", df[2,1], "$"), df[,1]) # <- the OP has an extra bracket here
grep(paste0("^", df[3,1], "$"), df[,1])
Option 1
Here we sanitise both the text being searched & the patterns used to search
Here we are just sanitising for round brackets but there are other special characters in regex (and cases where complex unicode characters also create problems)
df$sanitised_text <- gsub("[()]*", "", df$A)
Demonstrate Solution
grep(paste0("^", df[2, "sanitised_text"], "$"), df[,"sanitised_text"])
grep(paste0("^", df[3,"sanitised_text"], "$"), df[,"sanitised_text"])
Option 2 - Double escape the regex control characters
sanitise_search_patterns <- function(x){
y <- gsub("\\(", "\\\\(", x)
gsub("\\)", "\\\\)", y)
}
df$sanitised_search_patterns <- sanitise_search_patterns(df$A)
Demonstrate Solution
grep(paste0("^", df[2, "sanitised_search_patterns"], "$"), df[,"A"])
grep(paste0("^", df[3,"sanitised_search_patterns"], "$"), df[,"A"])
You could use either approach here but there are cases where non-control characters can create similar types of false negatives - e.g. a multiplicity of unicode characters for whitespace, hyphens and complex characters formed from more than one glyph - so sanitising the search text might still be usefully considered alongside double escaping.
CodePudding user response:
We can use
grep("[)]$]", df[,1])