Removing rows that contain above a certain share of upper-case letters in R-CodePudding

I have a large dataframe that consists of company identifiers and extracted phrases from newspapers. It is very messy, and I want to clean it by conditional row removing.

For this I want to remove rows that have more then 50% upper-case letters.

I have found this code from a post which will remove me rows with all upper-case letters:

data <- data[!grepl("^[A-Z] (?:[ -][A-Z] )*$", data$text), ]

How can I express it as a share of the total word or letter count?

CodePudding user response：

You could do this with regular expressions, but the stringi function stri_count_charclass provide a highly optimized version for detecting categories of characters. The package manual documents the List of Unicode General Categories, here we use string L for all letters, and Lu for uppercase letters.

Something like this should accomplish what you need:

library(stringi)

data <- data.frame(text = c("Foo",
                            "BAr",
                            "BAZ"))

data[which(stri_count_charclass(data[["text"]],"[\\p{Lu}]") / stri_count_charclass(data[["text"]],"[\\p{L}]") < 0.5),]
# [1] "Foo"

One note: I updated my answer here since I failed to point out a powerful feature of stringi in my original response. My instinctive reaction was to use [a-z] and [A-Z] to signify lower and upper case characters, respectively. However, using Unicode general categories allows the solution to work well for non-ascii characters as well.

x = c("Foo",
      "BAr",
      "BAZ",
      "Ḟoo",
      "ḂÁr",
      "ḂÁẒ")
stri_count_charclass(x,"[A-Z]")/stri_count_charclass(x,"[[a-z][A-Z]]")
[1] 0.3333333 0.6666667 1.0000000 0.0000000 0.0000000       NaN

stri_count_charclass(x,"[\\p{Lu}]")/stri_count_charclass(x,"[\\p{L}]")
[1] 0.3333333 0.6666667 1.0000000 0.3333333 0.6666667 1.0000000