R noob here, working in tidyverse
/ RStudio.
I have a categorical / factor variable that I'd like to retain in a group_by
/summarize
workflow. I'd like to summarize
it using a summary function that returns the most common value of that factor within each group.
Is there a summary function I can use for this?
mean
returns NA
, median
only works with numeric data, and summary
gives me separate rows with counts of each factor level instead of the most common level.
Edit: example using subset of mtcars
dataset:
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
21 6 160 110 3.9 2.62 16.5 0 1 4 4
21 6 160 110 3.9 2.88 17.0 0 1 4 4
22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
24.4 4 147. 62 3.69 3.19 20 1 0 4 2
22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
Here I have converted carb
into a factor variable. In this subset of the data, you can see that among 6-cylinder cars there are 3 with carb=4
and 1 with carb=1
; similarly among 4-cylinder cars there are 2 with carb=2
and 1 with carb=1
.
So if I do:
data %>% group_by(cyl) %>% summarise(modalcarb = FUNC(carb))
where FUNC
is the function I'm looking for, I should get:
cyl carb
<dbl> <fct>
4 2
6 4
8 2 # there are multiple potential ways of handling multi-modal situations, but that's secondary here
Hope that makes sense!
CodePudding user response:
You could use the function fmode
of collapse
to calculate the mode. Here I created a reproducible example using mtcars
dataset where the cyl
column is your factor variable to group on like this:
library(dplyr)
library(collapse)
mtcars %>%
mutate(cyl = as.factor(cyl)) %>%
group_by(cyl) %>%
summarise(mode = fmode(am))
#> # A tibble: 3 × 2
#> cyl mode
#> <fct> <dbl>
#> 1 4 1
#> 2 6 0
#> 3 8 0
Created on 2022-11-24 with reprex v2.0.2
CodePudding user response:
We could use which.max
after count
:
library(dplyr)
# fake dataset
x <- mtcars %>%
mutate(cyl = factor(cyl)) %>%
select(cyl)
x %>%
count(cyl) %>%
slice(which.max(n))
cyl n
<fct> <int>
1 8 14