my input:
library(tidyverse)
library(stringi)
tdf<-data.frame("foo"=c('|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|TI'),
"bar"=c('|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI'),
"xyz" = c('|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ReviewNG-ICV|TI|BB.2',
'|ReviewNG-ICV|TI|BB.2','|ReviewNG-ICV|TI|BB.2','|ReviewNG-ICV|TI|BB.2','|ICV'),
"gaz" = c('|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI',
'|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|NG-BB.2|ICV|AI|TI','|NG-BB.2|ICV|AI|TI','|NG-BB.2|ICV|AI|TI',
'|NG-BB.2|ICV|AI|TI','|BB.2'))
I trying count the number of occurrences of each label in my tdf
, all label have 4 "form": Total count of occurences, ReviewNG-label
, NG-label
and at least "pure" |label, |label|
. For example label AI
, have all matches total, have ReviewNG-AI
, NG-AI
, and |AI
or |AI|
pure form. So that my code:
pt_t <- c("AI" )
sum(stringi::stri_count_fixed(tdf, regex(pt_t)))
pt_rng <- c("ReviewNG-AI")
sum(stringi::stri_count_fixed(tdf, regex(pt_rng)))
pt_ng<-c("NG-AI")
sum(stringi::stri_count_fixed(tdf, regex(pt_ng)))
pt<-c("|AI","|AI|")
sum(stringi::stri_count_fixed(tdf, regex(pt)))
And my output:
Warning in stringi::stri_count_fixed(tdf, regex(pt_t)) :
argument is not an atomic vector; coercing
[1] 30
Warning in stringi::stri_count_fixed(tdf, regex(pt_rng)) :
argument is not an atomic vector; coercing
[1] 7
Warning in stringi::stri_count_fixed(tdf, regex(pt_ng)) :
argument is not an atomic vector; coercing
[1] 14
Warning in stringi::stri_count_fixed(tdf, regex(pt)) :
argument is not an atomic vector; coercing
[1] 15
First of all, I don't exactly understand at warning message.
Now let's look a count: For total it's Ok, for ReviewNG-AI
stil good. But next a problematic:
for NG-AI
I understand is double count NG
plus ReviewNG
, and last "pure" count for |AI' or '|AI|
I totally don't understand how it equally 15, where manually I count 16.
I also trying stringr
in tidyverse
but here really erroneous output:
sum(str_count(tdf,pt))
res<-tdf %>%
summarise(across(everything(),
~sum(str_count(.x, paste(pt)))))
rowSums(res)
CodePudding user response:
Maybe this kind of solution. As Martin already explained why and how we could do a different strategy.
If all Labels are separated by |
we could pivot_longer
and count
them. Depending on your desired output:
library(dplyr)
library(tidyr)
tdf %>%
pivot_longer(
everything()
) %>%
mutate(value = sub('\\|', '', value)) %>%
separate_rows(value, sep = "\\|") %>%
group_by(name, value) %>%
summarise(Labels = n())
name value Labels
<chr> <chr> <int>
1 bar AI 12
2 bar BB.2 7
3 bar ReviewNG-BB.2 4
4 foo NG-BB.3 6
5 foo ReviewNG-BB.2 11
6 foo ReviewNG-BB.3 5
7 foo TI 1
8 gaz AI 4
9 gaz BB.2 1
10 gaz BB.3 7
11 gaz ICV 4
12 gaz NG-BB.2 4
13 gaz NG-TI 7
14 gaz ReviewNG-AI 7
15 gaz TI 4
16 xyz BB.2 4
17 xyz ICV 8
18 xyz NG-AI 7
19 xyz ReviewNG-ICV 4
20 xyz TI 4
CodePudding user response:
Your problem here is using an special character in RegEx: |
is reserved for or
in RegEx. If we want to search for |
we need to escape with \\|
. So, for example:
library(dplyr)
library(stringr)
pt <- c("\\|AI", "\\|AI\\|")
Now, we want to count every occurence of |AI
and |AI|
, so the search pattern looks like this:
paste(pt, collapse = "|")
#> [1] "\\|AI|\\|AI\\|"
So, putting it all together:
tdf %>%
summarise(across(everything(),
~sum(str_count(.x, paste(pt, collapse = "|")))))
returns
foo bar xyz gaz
1 0 12 0 4