I have a large dataframe that consists of company identifiers and extracted phrases from newspapers. It is very messy, and I want to clean it by conditional row removing.
For this I want to remove rows that have more then 50% upper-case letters.
I have found this code from a post which will remove me rows with all upper-case letters:
data <- data[!grepl("^[A-Z] (?:[ -][A-Z] )*$", data$text), ]
How can I express it as a share of the total word or letter count?
CodePudding user response:
You could do this with regular expressions, but the stringi
function stri_count_charclass
provide a highly optimized version for detecting categories of characters. The package manual documents the List of Unicode General Categories, here we use string L
for all letters, and Lu
for uppercase letters.
Something like this should accomplish what you need:
library(stringi)
data <- data.frame(text = c("Foo",
"BAr",
"BAZ"))
data[which(stri_count_charclass(data[["text"]],"[\\p{Lu}]") / stri_count_charclass(data[["text"]],"[\\p{L}]") < 0.5),]
# [1] "Foo"
One note: I updated my answer here since I failed to point out a powerful feature of stringi
in my original response. My instinctive reaction was to use [a-z]
and [A-Z]
to signify lower and upper case characters, respectively. However, using Unicode general categories allows the solution to work well for non-ascii characters as well.
x = c("Foo",
"BAr",
"BAZ",
"Ḟoo",
"ḂÁr",
"ḂÁẒ")
stri_count_charclass(x,"[A-Z]")/stri_count_charclass(x,"[[a-z][A-Z]]")
[1] 0.3333333 0.6666667 1.0000000 0.0000000 0.0000000 NaN
stri_count_charclass(x,"[\\p{Lu}]")/stri_count_charclass(x,"[\\p{L}]")
[1] 0.3333333 0.6666667 1.0000000 0.3333333 0.6666667 1.0000000