Home > other >  R: Find the most frequent factor level within each group separately for each column
R: Find the most frequent factor level within each group separately for each column

Time:12-20

This is a basic question but I can't seem to figure it out. I have a data frame in R with multiple categorical variables.

df = data.frame(population = c(rep("A", 3), rep("B", 5), rep("C", 2)), var1 = c(rep("apple", 5), rep("banana", 5)), var2 = c(rep("blue", 2), rep("red", 4), rep("green", 4)), var3 = c(rep("pizza", 7), rep("soup", 3)))

I want to group the data by population, then within each population find the most common value (the most common factor level) for var1, var2, and var3. I want to do this independently for var1, var2, and var3, not for the set of values across var1, var2, and var3.

I am so far using the following approach to do this:

df %>% group_by(population) %>% count(population, var1, var2, var3) %>% slice_max(order_by = n, n = 1) %>% select(-n)

But it returns the following:

population var1   var2  var3 
A          apple  blue  pizza
B          apple  red   pizza
C          banana green soup 

These results are for the most common set of values across var1, var2, var3. But what I want is the most common value within var1 (independently of var2 and var3), the most common value within var2 (independently of var1 and var3), and the most common value within var3 (independently of var1 and var2). The result I want should be:

population var1   var2  var3 
A          apple  blue  pizza
B          banana  red   pizza
C          banana green soup 

CodePudding user response:

You could do:

df %>%
  group_by(population) %>%
  summarize(across(everything(), ~names(rev(sort(table(.x))))[1]))
#># A tibble: 3 x 4
#>  population var1   var2  var3 
#>  <chr>      <chr>  <chr> <chr>
#>1 A          apple  blue  pizza
#>2 B          banana red   pizza
#>3 C          banana green soup 

CodePudding user response:

You can create your own mode function, and then use across:

library(dplyr)
#This function was taken from https://stackoverflow.com/questions/2547402/how-to-find-the-statistical-mode
mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

df %>% 
  group_by(population) %>% 
  summarise(across(var1:var3, mode))

output

# A tibble: 3 × 4
  population var1   var2  var3 
  <chr>      <chr>  <chr> <chr>
1 A          apple  blue  pizza
2 B          banana red   pizza
3 C          banana green soup 

You can also take one from a package, here collapse:

library(collapse)
df %>% 
  group_by(population) %>% 
  summarise(across(var1:var3, fmode))
  • Related