R to detect accent-CodePudding

Is there a way using grepl or another function to detect all words that have accent? Not ignoring it, which has been ask many times, just to detect all the words that have any accent in it.

Thanks

CodePudding user response：

You can use tools::showNonASCII to detect characters that aren't in the ASCII set. That includes accented characters as well as some symbols and characters from other alphabets:

x <- c("aaaaaaaaaä", 
       "cccccccç", 
       "ccccccč", 
       "abc", 
       "€", 
       "$")

tools::showNonASCII(x)
#> 1: aaaaaaaaa<c3><a4>
#> 2: ccccccc<c3><a7>
#> 3: cccccc<c4><8d>
#> 5: <e2><82><ac>

^{Created on 2022-10-12 with reprex v2.0.2}

CodePudding user response：

In base R you could try:

data

txt <- c("aaaaaaaaaä", "cccccccç", "ccccccč", "abc", "nnnnnñ")
# fourth position doesn't have any accent

Find positions in vector:

grep("[\x7f-\xff]", txt)
# [1] 1 2 3 5

or boolean (TRUE/FALSE)

grepl("[\x7f-\xff]", txt)
# [1]  TRUE  TRUE  TRUE FALSE  TRUE

And to subset data:

# Only with accents
txt[grepl("[\x7f-\xff]", txt)]
# [1] "aaaaaaaaaä" "cccccccç"   "ccccccč"    "nnnnnñ"  

# Only without accents
txt[!grepl("[\x7f-\xff]", txt)]
#[1] "abc"

# could also use `grep()` instead of `grepl()` here

CodePudding user response：

Another solution - detect non-ASCII characters:

library(stringr)
str_detect(txt, "[^ -~]")
[1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE

where [^ -~] is a negated character class for ASCII characters (so, without negation, [ -~] matches any ASCII characters)

Or, using dplyr syntax:

library(dplyr)
library(stringr)
data.frame(txt) %>%
  filter(str_detect(txt, "[^ -~]"))
         txt
1 aaaaaaaaaä
2   cccccccç
3    ccccccč
4     nnnnnñ
5       ynàn

Data:

txt <- c("aaaaaaaaaä", "cccccccç", "ccccccč", "abc", "nnnnnñ", "xXXXz", "ynàn")