Home > Software engineering >  Find most frequently occuring string per row and its percentage
Find most frequently occuring string per row and its percentage

Time:01-05

I have a df in which every row is a question to which there are up to 7 answers (columns), including few NA's. I want to find the most frequently given answer per question and its percentage (and to return NA when there are ties). Ultimately, I wish to find questions where there is a consensus of >70% in the answers and the respective answer.

My data looks like this:

dat <- data.frame(rbind(A=c("Dog", "Dog", "Cat", "Dog","Dog", "Dog", "Cat"),
                        B=c("Dog", "Cat", "Cat", "Cat", "Cat", "Cat", "Cat"),
                        C=c("Cat", "Fox", "Fox", "Fish", "Dog", "Mouse", "Rat"),
                        D=c("Mouse", "Mouse", "Mouse", "Mouse", "Mouse", "Mouse", "Mouse"),
                        E=c("Pigeon", "Pigeon", "Seagull", "Pigeon", "Seagull", "Seagull", "Pigeon"),
                        G=c("Fox", "Fox", "Fox", NA, "Dog", "Dog", "Dog")))

I know that I can find the most common string using which.max():

dat$answer <- apply(dat,1,function(x) names(which.max(table(x))))

However, I aim to get something like this (where I can keep only final answers with a consensus of >.7)

output <- data.frame(cbind(dat[, 1:7], 
            rbind(A=c("Dog", .71),
                  B=c("Cat", .86),
                  c=c("Fox", .28),
                  D=c("Mouse", 1),
                  E=c("Pigeon", .57),
                  G=c(NA, NA))))  
colnames(output) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "Answer", "Perc")

All help is greatly appreciated! Cheers.

CodePudding user response:

You can create a Perc variable in the apply function:

dat[c("answer", "Perc")] <- 
  t(apply(dat, 1, function(x){
  c(names(sort(-table(x))[1]), max(table(x))[1] / ncol(dat))}))

      X1     X2      X3     X4      X5      X6     X7 answer Perc
A    Dog    Dog     Cat    Dog     Dog     Dog    Cat    Dog 0.71
B    Dog    Cat     Cat    Cat     Cat     Cat    Cat    Cat 0.86
C    Cat    Fox     Fox   Fish     Dog   Mouse    Rat    Fox 0.29
D  Mouse  Mouse   Mouse  Mouse   Mouse   Mouse  Mouse  Mouse    1
E Pigeon Pigeon Seagull Pigeon Seagull Seagull Pigeon Pigeon 0.57
G    Fox    Fox     Fox   <NA>     Dog     Dog    Dog    Dog 0.43

And assign NA if Perc is below a threshold:

dat[dat$Perc < .7, ]$Perc <- NA

      X1     X2      X3     X4      X5      X6     X7 answer Perc
A    Dog    Dog     Cat    Dog     Dog     Dog    Cat    Dog 0.71
B    Dog    Cat     Cat    Cat     Cat     Cat    Cat    Cat 0.86
C    Cat    Fox     Fox   Fish     Dog   Mouse    Rat    Fox <NA>
D  Mouse  Mouse   Mouse  Mouse   Mouse   Mouse  Mouse  Mouse    1
E Pigeon Pigeon Seagull Pigeon Seagull Seagull Pigeon Pigeon <NA>
G    Fox    Fox     Fox   <NA>     Dog     Dog    Dog    Dog <NA>

CodePudding user response:

I've tried a tidyverse solution. Unfortunately, there is no super elegant way. Here is a solution that only returns the answers with consensus > 0.7:

dat <- data.frame(rbind(A=c("Dog", "Dog", "Cat", "Dog","Dog", "Dog", "Cat"),
                        B=c("Dog", "Cat", "Cat", "Cat", "Cat", "Cat", "Cat"),
                        C=c("Cat", "Fox", "Fox", "Fish", "Dog", "Mouse", "Rat"),
                        D=c("Mouse", "Mouse", "Mouse", "Mouse", "Mouse", "Mouse", "Mouse"),
                        E=c("Pigeon", "Pigeon", "Seagull", "Pigeon", "Seagull", "Seagull", "Pigeon"),
                        G=c("Fox", "Fox", "Fox", NA, "Dog", "Dog", "Dog")))

library(dplyr)
library(tidyr)

dat %>% 
  tibble::rownames_to_column(var = "question") %>% 
  pivot_longer(
    cols = -question,
    names_to = "col_name",
    values_to = "answer"
  ) %>% 
  group_by(question, answer) %>% 
  summarise(n = n()) %>% 
  mutate(Perc = n / sum(n)) %>% 
  filter(Perc > 0.7)
#> `summarise()` has grouped output by 'question'. You can override using the
#> `.groups` argument.
#> # A tibble: 3 × 4
#> # Groups:   question [3]
#>   question answer     n  Perc
#>   <chr>    <chr>  <int> <dbl>
#> 1 A        Dog        5 0.714
#> 2 B        Cat        6 0.857
#> 3 D        Mouse      7 1

Created on 2023-01-04 by the reprex package (v1.0.0)

CodePudding user response:

Using base R

tbl1 <- table(data.frame(v1 = row.names(dat)[row(dat)], v2 = unlist(dat)))
ans <- colnames(tbl1)[max.col(tbl1)]
p1 <- proportions(tbl1, 1)
Perc <- do.call(pmax, as.data.frame.matrix(p1))
Perc[Perc < .7] <- NA
dat[c("answer", "Perc")] <- list(ans, Perc)

-output

> dat
      X1     X2      X3     X4      X5      X6     X7 answer      Perc
A    Dog    Dog     Cat    Dog     Dog     Dog    Cat    Dog 0.7142857
B    Dog    Cat     Cat    Cat     Cat     Cat    Cat    Cat 0.8571429
C    Cat    Fox     Fox   Fish     Dog   Mouse    Rat    Fox        NA
D  Mouse  Mouse   Mouse  Mouse   Mouse   Mouse  Mouse  Mouse 1.0000000
E Pigeon Pigeon Seagull Pigeon Seagull Seagull Pigeon Pigeon        NA
G    Fox    Fox     Fox   <NA>     Dog     Dog    Dog    Fox        NA
  • Related