This is a basic question but I can't seem to figure it out. I have a data frame in R with multiple categorical variables.
df = data.frame(population = c(rep("A", 3), rep("B", 5), rep("C", 2)), var1 = c(rep("apple", 5), rep("banana", 5)), var2 = c(rep("blue", 2), rep("red", 4), rep("green", 4)), var3 = c(rep("pizza", 7), rep("soup", 3)))
I want to group the data by population, then within each population find the most common value (the most common factor level) for var1, var2, and var3. I want to do this independently for var1, var2, and var3, not for the set of values across var1, var2, and var3.
I am so far using the following approach to do this:
df %>% group_by(population) %>% count(population, var1, var2, var3) %>% slice_max(order_by = n, n = 1) %>% select(-n)
But it returns the following:
population var1 var2 var3
A apple blue pizza
B apple red pizza
C banana green soup
These results are for the most common set of values across var1, var2, var3. But what I want is the most common value within var1 (independently of var2 and var3), the most common value within var2 (independently of var1 and var3), and the most common value within var3 (independently of var1 and var2). The result I want should be:
population var1 var2 var3
A apple blue pizza
B banana red pizza
C banana green soup
CodePudding user response:
You could do:
df %>%
group_by(population) %>%
summarize(across(everything(), ~names(rev(sort(table(.x))))[1]))
#># A tibble: 3 x 4
#> population var1 var2 var3
#> <chr> <chr> <chr> <chr>
#>1 A apple blue pizza
#>2 B banana red pizza
#>3 C banana green soup
CodePudding user response:
You can create your own mode
function, and then use across
:
library(dplyr)
#This function was taken from https://stackoverflow.com/questions/2547402/how-to-find-the-statistical-mode
mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df %>%
group_by(population) %>%
summarise(across(var1:var3, mode))
output
# A tibble: 3 × 4
population var1 var2 var3
<chr> <chr> <chr> <chr>
1 A apple blue pizza
2 B banana red pizza
3 C banana green soup
You can also take one from a package, here collapse
:
library(collapse)
df %>%
group_by(population) %>%
summarise(across(var1:var3, fmode))