Home > Net >  Accent insensitive regex in R
Accent insensitive regex in R

Time:11-22

I'm trying to use filter(grepl()) to match some words in my column. Let's suppose I want to extract the word "Guartelá". In my column, i have variations such as "guartela" "guartelá" and "Guartela". To match upper/lowercase words I'm using (?i). However, I haven't found a good way to match accent/no-accent (i.e., "guartelá" and "guartela").

I know that I can simply substitute á by a, but is there a way to assign the accent-insensitive in the code? It can be base R/tidyverse/any, I don't mind.

Here's how my curent code line is:

cobras <- final %>% filter(grepl("(?i)guartelá", NAME) 
                           | grepl("(?i)guartelá", locality))

Cheers

CodePudding user response:

you can use stri_trans_general fron stringi to remove all accents:

unaccent_chars= stringi::stri_trans_general(c("guartelá","with_é","with_â","with_ô")  ,"Latin-ASCII")
unaccent_chars
# [1] "guartela" "with_e"   "with_a"   "with_o" 
# grepl(paste(unaccent_chars,collapse = "|"), string)  

CodePudding user response:

You can pass options in OR statements using [ to capture different combinations

> string <- c("Guartelá", "Guartela", "guartela", "guartelá", "any")
> grepl("[Gg]uartel[aá]", string)
[1]  TRUE  TRUE  TRUE  TRUE FALSE

CodePudding user response:

Another option using str_detect():

library(tidyverse)
tibble(name = c("guartela","guartelá", "Guartela", "Other")) |> 
  filter(str_detect(name, "guartela|guartelá|Guartela"))
  • Related