find and compare and count matching of pattern in each column in datarame-CodePudding

my input:

df <- data.frame("Foo"=c("a","c","NG-c","d","e","f"), "Bar"=c("b","b","c","d","e","f"), "Baz" = c("a","a","c","NG-c","NG-c","d")
                 "Gaz" = c("NG-c","NG-c","NG-c", "NG-a","NG-a","NG-a"))
patern <- c("a","c")

A problem look a little bit complicated. I trying find&count&compare by pattern each column in dataframe. For example - I want find all matching NG-c and output in which column the biggest percentage of NG-c from total in each column. That my code:

bg <- c()
for (i in ncol(df)) {
  for (pt in length(patern)) {
    tot <- sum(str_count(df[i],patern[pt]))
    ng <- sum(str_count(df[i],paste0("NG-",patern[pt] )))
    res <- round((ng/tot*100),1)
    bg <- c(bg,res) 
                              }
    if (bg[pt] >= res) {  
      print(colnames(df[i])) 
                        }    
                      }

So I expect see Baz and Gaz column name, but I have some troubles.
First I get warning messages:

Error in if (bg[pt] >= res) { : missing value where TRUE/FALSE needed

And second:

Warning messages: 1: In stri_count_regex(string, pattern, opts_regex = opts(pattern)) : argument is not an atomic vector; coercing 2: In stri_count_regex(string, pattern, opts_regex = opts(pattern)) :
argument is not an atomic vector; coercing

Perhaps there is a better/clever way?

CodePudding user response：

Not sure if this is what your are looking for and how your question text is related to the patern vector. If you want to count the occurences of NG-c per column and calculate the percantage of NG-cs per column, you could use

library(dplyr)
library(stringr)

df %>% 
  summarise(across(everything(),
                   ~sum(str_count(.x, "NG-c"))/n()))

This returns

        Foo Bar       Baz Gaz
1 0.1666667   0 0.3333333 0.5

Data

df <- structure(list(Foo = c("a", "c", "NG-c", "d", "e", "f"), Bar = c("b", 
"b", "c", "d", "e", "f"), Baz = c("a", "a", "c", "NG-c", "NG-c", 
"d"), Gaz = c("NG-c", "NG-c", "NG-c", "NG-a", "NG-a", "NG-a")), class = "data.frame", row.names = c(NA, 
-6L))