I have a df in which every row is a question to which there are up to 7 answers (columns), including few NA's. I want to find the most frequently given answer per question and its percentage (and to return NA when there are ties). Ultimately, I wish to find questions where there is a consensus of >70% in the answers and the respective answer.
My data looks like this:
dat <- data.frame(rbind(A=c("Dog", "Dog", "Cat", "Dog","Dog", "Dog", "Cat"),
B=c("Dog", "Cat", "Cat", "Cat", "Cat", "Cat", "Cat"),
C=c("Cat", "Fox", "Fox", "Fish", "Dog", "Mouse", "Rat"),
D=c("Mouse", "Mouse", "Mouse", "Mouse", "Mouse", "Mouse", "Mouse"),
E=c("Pigeon", "Pigeon", "Seagull", "Pigeon", "Seagull", "Seagull", "Pigeon"),
G=c("Fox", "Fox", "Fox", NA, "Dog", "Dog", "Dog")))
I know that I can find the most common string using which.max():
dat$answer <- apply(dat,1,function(x) names(which.max(table(x))))
However, I aim to get something like this (where I can keep only final answers with a consensus of >.7)
output <- data.frame(cbind(dat[, 1:7],
rbind(A=c("Dog", .71),
B=c("Cat", .86),
c=c("Fox", .28),
D=c("Mouse", 1),
E=c("Pigeon", .57),
G=c(NA, NA))))
colnames(output) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "Answer", "Perc")
All help is greatly appreciated! Cheers.
CodePudding user response:
You can create a Perc variable in the apply
function:
dat[c("answer", "Perc")] <-
t(apply(dat, 1, function(x){
c(names(sort(-table(x))[1]), max(table(x))[1] / ncol(dat))}))
X1 X2 X3 X4 X5 X6 X7 answer Perc
A Dog Dog Cat Dog Dog Dog Cat Dog 0.71
B Dog Cat Cat Cat Cat Cat Cat Cat 0.86
C Cat Fox Fox Fish Dog Mouse Rat Fox 0.29
D Mouse Mouse Mouse Mouse Mouse Mouse Mouse Mouse 1
E Pigeon Pigeon Seagull Pigeon Seagull Seagull Pigeon Pigeon 0.57
G Fox Fox Fox <NA> Dog Dog Dog Dog 0.43
And assign NA if Perc
is below a threshold:
dat[dat$Perc < .7, ]$Perc <- NA
X1 X2 X3 X4 X5 X6 X7 answer Perc
A Dog Dog Cat Dog Dog Dog Cat Dog 0.71
B Dog Cat Cat Cat Cat Cat Cat Cat 0.86
C Cat Fox Fox Fish Dog Mouse Rat Fox <NA>
D Mouse Mouse Mouse Mouse Mouse Mouse Mouse Mouse 1
E Pigeon Pigeon Seagull Pigeon Seagull Seagull Pigeon Pigeon <NA>
G Fox Fox Fox <NA> Dog Dog Dog Dog <NA>
CodePudding user response:
I've tried a tidyverse solution. Unfortunately, there is no super elegant way. Here is a solution that only returns the answers with consensus > 0.7:
dat <- data.frame(rbind(A=c("Dog", "Dog", "Cat", "Dog","Dog", "Dog", "Cat"),
B=c("Dog", "Cat", "Cat", "Cat", "Cat", "Cat", "Cat"),
C=c("Cat", "Fox", "Fox", "Fish", "Dog", "Mouse", "Rat"),
D=c("Mouse", "Mouse", "Mouse", "Mouse", "Mouse", "Mouse", "Mouse"),
E=c("Pigeon", "Pigeon", "Seagull", "Pigeon", "Seagull", "Seagull", "Pigeon"),
G=c("Fox", "Fox", "Fox", NA, "Dog", "Dog", "Dog")))
library(dplyr)
library(tidyr)
dat %>%
tibble::rownames_to_column(var = "question") %>%
pivot_longer(
cols = -question,
names_to = "col_name",
values_to = "answer"
) %>%
group_by(question, answer) %>%
summarise(n = n()) %>%
mutate(Perc = n / sum(n)) %>%
filter(Perc > 0.7)
#> `summarise()` has grouped output by 'question'. You can override using the
#> `.groups` argument.
#> # A tibble: 3 × 4
#> # Groups: question [3]
#> question answer n Perc
#> <chr> <chr> <int> <dbl>
#> 1 A Dog 5 0.714
#> 2 B Cat 6 0.857
#> 3 D Mouse 7 1
Created on 2023-01-04 by the reprex package (v1.0.0)
CodePudding user response:
Using base R
tbl1 <- table(data.frame(v1 = row.names(dat)[row(dat)], v2 = unlist(dat)))
ans <- colnames(tbl1)[max.col(tbl1)]
p1 <- proportions(tbl1, 1)
Perc <- do.call(pmax, as.data.frame.matrix(p1))
Perc[Perc < .7] <- NA
dat[c("answer", "Perc")] <- list(ans, Perc)
-output
> dat
X1 X2 X3 X4 X5 X6 X7 answer Perc
A Dog Dog Cat Dog Dog Dog Cat Dog 0.7142857
B Dog Cat Cat Cat Cat Cat Cat Cat 0.8571429
C Cat Fox Fox Fish Dog Mouse Rat Fox NA
D Mouse Mouse Mouse Mouse Mouse Mouse Mouse Mouse 1.0000000
E Pigeon Pigeon Seagull Pigeon Seagull Seagull Pigeon Pigeon NA
G Fox Fox Fox <NA> Dog Dog Dog Fox NA