understanding count the number of occurrences of a pattern in a string-CodePudding

my input:

library(tidyverse)
library(stringi)
tdf<-data.frame("foo"=c('|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|TI'), 
"bar"=c('|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI'), 
                 "xyz" = c('|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ReviewNG-ICV|TI|BB.2',
'|ReviewNG-ICV|TI|BB.2','|ReviewNG-ICV|TI|BB.2','|ReviewNG-ICV|TI|BB.2','|ICV'),
                 "gaz" = c('|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI',
'|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|NG-BB.2|ICV|AI|TI','|NG-BB.2|ICV|AI|TI','|NG-BB.2|ICV|AI|TI',
'|NG-BB.2|ICV|AI|TI','|BB.2'))

I trying count the number of occurrences of each label in my tdf, all label have 4 "form": Total count of occurences, ReviewNG-label, NG-label and at least "pure" |label, |label|. For example label AI, have all matches total, have ReviewNG-AI, NG-AI, and |AI or |AI| pure form. So that my code:

pt_t <- c("AI" )
sum(stringi::stri_count_fixed(tdf, regex(pt_t)))
pt_rng <- c("ReviewNG-AI")
sum(stringi::stri_count_fixed(tdf, regex(pt_rng)))
pt_ng<-c("NG-AI")
sum(stringi::stri_count_fixed(tdf, regex(pt_ng)))
pt<-c("|AI","|AI|")
sum(stringi::stri_count_fixed(tdf, regex(pt)))

And my output:

Warning in stringi::stri_count_fixed(tdf, regex(pt_t)) :
  argument is not an atomic vector; coercing
[1] 30
Warning in stringi::stri_count_fixed(tdf, regex(pt_rng)) :
  argument is not an atomic vector; coercing
[1] 7
Warning in stringi::stri_count_fixed(tdf, regex(pt_ng)) :
  argument is not an atomic vector; coercing
[1] 14
Warning in stringi::stri_count_fixed(tdf, regex(pt)) :
  argument is not an atomic vector; coercing
[1] 15

First of all, I don't exactly understand at warning message. Now let's look a count: For total it's Ok, for ReviewNG-AI stil good. But next a problematic: for NG-AI I understand is double count NG plus ReviewNG, and last "pure" count for |AI' or '|AI| I totally don't understand how it equally 15, where manually I count 16.

I also trying stringr in tidyverse but here really erroneous output:

sum(str_count(tdf,pt))

res<-tdf %>% 
  summarise(across(everything(),
                   ~sum(str_count(.x, paste(pt)))))

rowSums(res)

CodePudding user response：

Maybe this kind of solution. As Martin already explained why and how we could do a different strategy. If all Labels are separated by |

we could pivot_longer and count them. Depending on your desired output:

library(dplyr)
library(tidyr)

tdf %>% 
  pivot_longer(
    everything()
  ) %>% 
  mutate(value = sub('\\|', '', value)) %>% 
  separate_rows(value, sep = "\\|") %>% 
  group_by(name, value) %>% 
  summarise(Labels = n())

   name  value         Labels
   <chr> <chr>          <int>
 1 bar   AI                12
 2 bar   BB.2               7
 3 bar   ReviewNG-BB.2      4
 4 foo   NG-BB.3            6
 5 foo   ReviewNG-BB.2     11
 6 foo   ReviewNG-BB.3      5
 7 foo   TI                 1
 8 gaz   AI                 4
 9 gaz   BB.2               1
10 gaz   BB.3               7
11 gaz   ICV                4
12 gaz   NG-BB.2            4
13 gaz   NG-TI              7
14 gaz   ReviewNG-AI        7
15 gaz   TI                 4
16 xyz   BB.2               4
17 xyz   ICV                8
18 xyz   NG-AI              7
19 xyz   ReviewNG-ICV       4
20 xyz   TI                 4

CodePudding user response：

Your problem here is using an special character in RegEx: | is reserved for or in RegEx. If we want to search for | we need to escape with \\|. So, for example:

library(dplyr)
library(stringr)

pt <- c("\\|AI", "\\|AI\\|")

Now, we want to count every occurence of |AI and |AI|, so the search pattern looks like this:

paste(pt, collapse = "|")
#> [1] "\\|AI|\\|AI\\|"

So, putting it all together:

tdf %>% 
  summarise(across(everything(),
                   ~sum(str_count(.x, paste(pt, collapse = "|")))))

returns

  foo bar xyz gaz
1   0  12   0   4