Home > Blockchain >  How to return the predominant category using dplyr::mutate?
How to return the predominant category using dplyr::mutate?

Time:08-08

I have a data:

set.seed(51)
df_1 <- data.frame(
  nomes = LETTERS[1:100], 
  filtro1 = sample(x = c("sim", "não"), size = 100, replace = TRUE),
  filtro2 = sample(x = c("sim", "não"), size = 100, replace = TRUE), 
  genero = sample(x = c("masculino", "feminino"), size = 100, replace = TRUE), 
  groups = sample(x = 1:3, size = 100, replace = TRUE)
)

And this code:

library(dplyr)

df_1 %>% 
  group_by(groups, genero) %>% 
  summarise(count = n()) %>% 
  mutate(percent = count/sum(count)) %>% 
  filter(count == max(count))

The result is:

# Groups:   groups [3]
  groups genero    count percent
   <int> <chr>     <int>   <dbl>
1      1 feminino     16   0.533
2      2 masculino    19   0.633
3      3 masculino    21   0.525

I would like these categories to be recycled with mutate. That is, that the maximum values ​​were repeated in their respective groups. See:

df_1 %>% 
  group_by(groups, genero) %>% 
  mutate(count = n()) %>% # replace summarise by mutate
  mutate(percent = count/sum(count)) %>% 
  filter(count == max(count))

Doesn't work.

I would like the values ​​to repeat along the new column with mutate. Like this:

0.533

0.633

0.525

0.533

0.525

0.633

0.525

...

CodePudding user response:

  1. summarise() automatically drops the last grouping level. mutate() doesn’t do this, so you have to do so manually with a second group_by().
  2. Because you still have multiple rows per group after mutate(), sum(count) won’t give you what you want (the overall n per group). Instead, use another call to n().
library(dplyr)

df_1 %>% 
  group_by(groups, genero) %>% 
  mutate(count = n()) %>%
  group_by(groups) %>%
  mutate(percent = count/n()) %>% 
  filter(count == max(count))

Output:

# A tibble: 58 × 7
# Groups:   groups [3]
   nomes filtro1 filtro2 genero   groups count percent
                   
 1 A     sim     sim     feminino      1    16   0.593
 2 C     sim     sim     feminino      1    16   0.593
 3 E     não     não     feminino      1    16   0.593
 4 H     sim     não     feminino      2    20   0.541
 5 J     não     não     feminino      3    22   0.611
 6 K     não     não     feminino      2    20   0.541
 7 M     não     não     feminino      1    16   0.593
 8 N     não     sim     feminino      1    16   0.593
 9 P     não     não     feminino      2    20   0.541
10 R     não     não     feminino      1    16   0.593
# … with 48 more rows
# ℹ Use `print(n = ...)` to see more rows
  • Related