Assuming the following data, I want to count the unique characters per row.
test <- data.frame(oe = c("A-1", "111", "-", "Sie befassen sich intensiv damit"))
So I thought I'm using the [:graph:]
helper to capture letters, numbers and punctuation. However, it gives the wrong results, see below:
library(tidyverse)
test %>%
mutate(unique_chars_correct = sapply(tolower(oe), function(x) sum(str_count(x, c(letters, 0:9, "-")) > 0)),
unique_chars_wrong = sapply(tolower(oe), function(x) sum(str_count(x, "[:graph:]") > 0)))
which gives:
oe unique_chars_correct unique_chars_wrong
1 A-1\\. 3 1
2 111 1 1
3 - 1 1
4 Sie befassen sich intensiv damit 13 1
I assume, using [:graph:]
kind of checks if any of the chars satisfies being part of [:graph:]
, but want to do is to check every element that is part of [:graph:]
.
CodePudding user response:
The [:graph:]
gives the total count and it is not differentiating the unique
characters
> str_count(test$oe, "[:graph:]")
[1] 3 3 1
Thus, when we convert to a logical (> 0
) and take the sum
it returns just 1
and it doesn't differentiate between numbers/letters/punct.
If we need to get the expected
Reduce(` `, lapply(c("[:alpha:]", "[:digit:]", "[:punct:]"),
function(x) str_count(tolower(test$oe), x) >0) )
[1] 3 1 1
Or may split and then use [:graph:]
on the unique
values
sapply(strsplit(tolower(test$oe), ""), function(x)
sum(str_count(unique(x), "[:graph:]") > 0))
[1] 3 1 1
CodePudding user response:
You can use backreference and lookaround for this:
Data:
test <- data.frame(oe = c("A-1", "111", "-", "Abaa", "B cbb b"))
EDITED Solution: (also accounts for whitespace, which is not counted, as well as upper- and lower-case distinctions, which are disregarded=
library(stringr)
str_count(test$oe, "(?i)([^\\s])(?!.*\\1)")
[1] 3 1 1 2 2
How this works:
(?i)
: case-insensitive match([^\\s])
: a capture group matching any character that is not a whitespace char(?!
: the start of a negative lookahead, preventing the matching and, hence, inclusion in thestr_count
operation of what follows:.*
: any character occurring zero or more times\\1
: backreference recalling the exact match of the capturing group(.)
and thus, in the context of the negative lookahead, effectively preventing the matching and counting of any repetitions of it)
: end of negative lookahead
EDIT:
alternatively you can use dplyr
:
library(dplyr)
test %>%
mutate(
# set to lower-case and remove whitespace:
oe = tolower(gsub("\\s", "", oe)),
# split the strings into separate chars:
oe_splt = str_split(oe, ""),
# count unique chars:
count_unq = lengths(sapply(oe_splt, function(x) unique(x))))