Home > Software engineering >  Assign most common value of factor variable with summarize in R
Assign most common value of factor variable with summarize in R

Time:11-25

R noob here, working in tidyverse / RStudio.

I have a categorical / factor variable that I'd like to retain in a group_by/summarize workflow. I'd like to summarize it using a summary function that returns the most common value of that factor within each group.

Is there a summary function I can use for this?

mean returns NA, median only works with numeric data, and summary gives me separate rows with counts of each factor level instead of the most common level.

Edit: example using subset of mtcars dataset:

mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear carb 
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
21       6  160    110  3.9   2.62  16.5     0     1     4 4    
21       6  160    110  3.9   2.88  17.0     0     1     4 4    
22.8     4  108     93  3.85  2.32  18.6     1     1     4 1    
21.4     6  258    110  3.08  3.22  19.4     1     0     3 1    
18.7     8  360    175  3.15  3.44  17.0     0     0     3 2    
18.1     6  225    105  2.76  3.46  20.2     1     0     3 1    
14.3     8  360    245  3.21  3.57  15.8     0     0     3 4    
24.4     4  147.    62  3.69  3.19  20       1     0     4 2    
22.8     4  141.    95  3.92  3.15  22.9     1     0     4 2    
19.2     6  168.   123  3.92  3.44  18.3     1     0     4 4

Here I have converted carb into a factor variable. In this subset of the data, you can see that among 6-cylinder cars there are 3 with carb=4 and 1 with carb=1; similarly among 4-cylinder cars there are 2 with carb=2 and 1 with carb=1.

So if I do:

data %>% group_by(cyl) %>% summarise(modalcarb = FUNC(carb))

where FUNC is the function I'm looking for, I should get:

cyl carb 
<dbl> <fct>
4    2    
6    4    
8    2  # there are multiple potential ways of handling multi-modal situations, but that's secondary here   

Hope that makes sense!

CodePudding user response:

You could use the function fmode of collapse to calculate the mode. Here I created a reproducible example using mtcars dataset where the cyl column is your factor variable to group on like this:

library(dplyr)
library(collapse)

mtcars %>%
  mutate(cyl = as.factor(cyl)) %>%
  group_by(cyl) %>%
  summarise(mode = fmode(am))
#> # A tibble: 3 × 2
#>   cyl    mode
#>   <fct> <dbl>
#> 1 4         1
#> 2 6         0
#> 3 8         0

Created on 2022-11-24 with reprex v2.0.2

CodePudding user response:

We could use which.max after count:

library(dplyr)

# fake dataset
x <- mtcars %>% 
  mutate(cyl = factor(cyl)) %>% 
  select(cyl) 

x %>% 
  count(cyl) %>% 
  slice(which.max(n))
  cyl       n
  <fct> <int>
1 8        14
  • Related