Dplyr: How can i calculate proportional means?-CodePudding

Say I have a dataset that looks like this:

> dput(df)
structure(list(Fruit = c("Apple", "Apple", "Apple", "Apple", 
"Apple", "Apple", "Apple", "Apple", "Apple", "Apple", "Orange", 
"Banana", "Banana"), Status = c("Eaten", "Eaten", "Eaten", "Trashed", 
"Eaten", "Trashed", "Eaten", "Eaten", "Trashed", "Eaten", "Eaten", 
"Trashed", "Trashed")), class = "data.frame", row.names = c(NA, 
-13L))

By just looking at it, we see that Apples have the highest count of Trashed, but this is because it also appears most frequently. On the other hand, we see that Banana has 100% occurrences of it being Trashed.

Which function would allow me to see which fruit type is proportionally trashed the most? For example, the output should be: Banana-100% trashed Orange-0% trashed Apple-30% trashed

I would also like to create a visualization for this as well.

Thanks!

CodePudding user response：

The bare-bones base-r solution is tapply for the calculation, and barplot for the plot:

out <- tapply(df$Status == "Trashed", df$Fruit, mean) * 100
barplot(out)

CodePudding user response：

We could convert the column to logical and do a group by summarise

library(dplyr)
df %>% 
   group_by(Fruit) %>%
   summarise(TrashPerc = 100 * mean(Status == "Trashed"))

-output

# A tibble: 3 × 2
  Fruit  TrashPerc
  <chr>      <dbl>
1 Apple         30
2 Banana       100
3 Orange         0

If we need to get bar plot

library(ggplot2)
df %>% 
    group_by(Fruit) %>%
    summarise(TrashPerc = 100 * mean(Status == "Trashed")) %>% 
    ggplot(aes(x = Fruit, y = TrashPerc))   
      geom_col()

CodePudding user response：

One way to solve your problem using the built-in functions table and proportions to get the relative frequency, and barplot for the plot.

barplot(proportions(table(df), "Fruit")[, "Trashed"] * 100)