Is there a way using grepl or another function to detect all words that have accent? Not ignoring it, which has been ask many times, just to detect all the words that have any accent in it.
Thanks
CodePudding user response:
You can use tools::showNonASCII
to detect characters that aren't in the ASCII set. That includes accented characters as well as some symbols and characters from other alphabets:
x <- c("aaaaaaaaaä",
"cccccccç",
"ccccccč",
"abc",
"€",
"$")
tools::showNonASCII(x)
#> 1: aaaaaaaaa<c3><a4>
#> 2: ccccccc<c3><a7>
#> 3: cccccc<c4><8d>
#> 5: <e2><82><ac>
Created on 2022-10-12 with reprex v2.0.2
CodePudding user response:
In base R you could try:
data
txt <- c("aaaaaaaaaä", "cccccccç", "ccccccč", "abc", "nnnnnñ")
# fourth position doesn't have any accent
Find positions in vector:
grep("[\x7f-\xff]", txt)
# [1] 1 2 3 5
or boolean (TRUE
/FALSE
)
grepl("[\x7f-\xff]", txt)
# [1] TRUE TRUE TRUE FALSE TRUE
And to subset data:
# Only with accents
txt[grepl("[\x7f-\xff]", txt)]
# [1] "aaaaaaaaaä" "cccccccç" "ccccccč" "nnnnnñ"
# Only without accents
txt[!grepl("[\x7f-\xff]", txt)]
#[1] "abc"
# could also use `grep()` instead of `grepl()` here
CodePudding user response:
Another solution - detect non-ASCII characters:
library(stringr)
str_detect(txt, "[^ -~]")
[1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE
where [^ -~]
is a negated character class for ASCII characters (so, without negation, [ -~]
matches any ASCII characters)
Or, using dplyr
syntax:
library(dplyr)
library(stringr)
data.frame(txt) %>%
filter(str_detect(txt, "[^ -~]"))
txt
1 aaaaaaaaaä
2 cccccccç
3 ccccccč
4 nnnnnñ
5 ynàn
Data:
txt <- c("aaaaaaaaaä", "cccccccç", "ccccccč", "abc", "nnnnnñ", "xXXXz", "ynàn")