How to "unwrap" regexp helpers like [:digit:] in str

Assuming the following data, I want to count the unique characters per row.

test <- data.frame(oe = c("A-1", "111", "-", "Sie befassen sich intensiv damit"))

So I thought I'm using the [:graph:] helper to capture letters, numbers and punctuation. However, it gives the wrong results, see below:

library(tidyverse)
test %>%
  mutate(unique_chars_correct = sapply(tolower(oe), function(x) sum(str_count(x, c(letters, 0:9, "-")) > 0)),
         unique_chars_wrong   = sapply(tolower(oe), function(x) sum(str_count(x, "[:graph:]") > 0)))

which gives:

                                oe unique_chars_correct unique_chars_wrong
1                           A-1\\.                    3                  1
2                              111                    1                  1
3                                -                    1                  1
4 Sie befassen sich intensiv damit                   13                  1

I assume, using [:graph:] kind of checks if any of the chars satisfies being part of [:graph:], but want to do is to check every element that is part of [:graph:].

CodePudding user response：

The [:graph:] gives the total count and it is not differentiating the unique characters

> str_count(test$oe, "[:graph:]")
[1] 3 3 1

Thus, when we convert to a logical (> 0) and take the sum it returns just 1

and it doesn't differentiate between numbers/letters/punct.

If we need to get the expected

Reduce(` `, lapply(c("[:alpha:]", "[:digit:]", "[:punct:]"), 
        function(x) str_count(tolower(test$oe), x) >0) )
[1] 3 1 1

Or may split and then use [:graph:] on the unique values

sapply(strsplit(tolower(test$oe), ""), function(x)
      sum(str_count(unique(x), "[:graph:]") > 0))
[1] 3 1 1

CodePudding user response：

You can use backreference and lookaround for this:

Data:

test <- data.frame(oe = c("A-1", "111", "-", "Abaa", "B cbb b"))

EDITED Solution: (also accounts for whitespace, which is not counted, as well as upper- and lower-case distinctions, which are disregarded=

library(stringr)
str_count(test$oe, "(?i)([^\\s])(?!.*\\1)")
[1] 3 1 1 2 2

How this works:

(?i): case-insensitive match
([^\\s]): a capture group matching any character that is not a whitespace char
(?!: the start of a negative lookahead, preventing the matching and, hence, inclusion in the str_count operation of what follows:
.*: any character occurring zero or more times
\\1: backreference recalling the exact match of the capturing group (.)and thus, in the context of the negative lookahead, effectively preventing the matching and counting of any repetitions of it
): end of negative lookahead

EDIT:

alternatively you can use dplyr:

library(dplyr)
test %>%
  mutate(
    # set to lower-case and remove whitespace:
    oe = tolower(gsub("\\s", "", oe)),
    # split the strings into separate chars:
    oe_splt = str_split(oe, ""),
    # count unique chars:
    count_unq = lengths(sapply(oe_splt, function(x) unique(x))))