Home > front end >  find and compare and count matching of pattern in each column in datarame
find and compare and count matching of pattern in each column in datarame

Time:10-28

my input:

df <- data.frame("Foo"=c("a","c","NG-c","d","e","f"), "Bar"=c("b","b","c","d","e","f"), "Baz" = c("a","a","c","NG-c","NG-c","d")
                 "Gaz" = c("NG-c","NG-c","NG-c", "NG-a","NG-a","NG-a"))
patern <- c("a","c")   

A problem look a little bit complicated. I trying find&count&compare by pattern each column in dataframe. For example - I want find all matching NG-c and output in which column the biggest percentage of NG-c from total in each column. That my code:

bg <- c()
for (i in ncol(df)) {
  for (pt in length(patern)) {
    tot <- sum(str_count(df[i],patern[pt]))
    ng <- sum(str_count(df[i],paste0("NG-",patern[pt] )))
    res <- round((ng/tot*100),1)
    bg <- c(bg,res) 
                              }
    if (bg[pt] >= res) {  
      print(colnames(df[i])) 
                        }    
                      }

So I expect see Baz and Gaz column name, but I have some troubles.
First I get warning messages:

Error in if (bg[pt] >= res) { : missing value where TRUE/FALSE needed

And second:

Warning messages: 1: In stri_count_regex(string, pattern, opts_regex = opts(pattern)) : argument is not an atomic vector; coercing 2: In stri_count_regex(string, pattern, opts_regex = opts(pattern)) :
argument is not an atomic vector; coercing

Perhaps there is a better/clever way?

CodePudding user response:

Not sure if this is what your are looking for and how your question text is related to the patern vector. If you want to count the occurences of NG-c per column and calculate the percantage of NG-cs per column, you could use

library(dplyr)
library(stringr)

df %>% 
  summarise(across(everything(),
                   ~sum(str_count(.x, "NG-c"))/n()))

This returns

        Foo Bar       Baz Gaz
1 0.1666667   0 0.3333333 0.5

Data

df <- structure(list(Foo = c("a", "c", "NG-c", "d", "e", "f"), Bar = c("b", 
"b", "c", "d", "e", "f"), Baz = c("a", "a", "c", "NG-c", "NG-c", 
"d"), Gaz = c("NG-c", "NG-c", "NG-c", "NG-a", "NG-a", "NG-a")), class = "data.frame", row.names = c(NA, 
-6L))
  • Related