Home > Software engineering >  How to select top N values and group the rest of the remaining ones
How to select top N values and group the rest of the remaining ones

Time:12-11

How can I rank the first 4 group of my dataframe associated to the highest value in the count column and create a 5th group summing up the remaining groups and their associated values ?

What I did so far:

dummy_dataframe <- data.frame(group = c("A", "B", "A", "A", "C", "C", "D", "E", "F", "D","G")) 

df_aggregate <- aggregate(cbind(count = group) ~ group, 
                         data = dumy_dataframe, 
                         FUN = function(x){NROW(x)})

df_sliced <- df_aggregate %>%
       arrange(desc(count)) %>% 
      slice(1:4) 

With the code above I get a dataframe with the 4 groups associated to the highest value but how I could have a fith group summing up the value of the missing group (E, F and G) ? For instance something like this:

   group     count
1     A        3
2     B        1
3     C        2
4     D        2
5   others     3

CodePudding user response:

You can run some tidyverse operations directly on your original dataframe:

library(tidyverse)
dummy_dataframe %>%
  count(group) %>%
  mutate(id = if_else(row_number() < 5, 1L, 2L)) %>%
  group_by(id) %>%
  arrange(id, -n) %>%
  mutate(group = if_else(id == 2L, "others", group),
         n = if_else(group == "others", sum(n), n)) %>%
  ungroup() %>%
  distinct() %>%
  select(-id)

which gives:

# A tibble: 5 x 2
  group      n
  <chr>  <int>
1 A          3
2 C          2
3 D          2
4 B          1
5 others     3

CodePudding user response:

Short & sweet:

result<-rbind(df_aggregate[order(df_aggregate$count,decreasing = T),][c(1:4),],c("rest",sum(df_aggregate[order(df_aggregate$count,decreasing = T),][c(5:nrow(df_aggregate)),2])))

CodePudding user response:

I would go completely with the dplyr package and its possibility:

library(dplyr)

dummy_dataframe <- data.frame(group = c("A", "B", "A", "A", "C", "C", "D", "E", "F", "D","G")) 

df_aggregate <- dummy_dataframe %>%
  group_by(group
           ) %>%
  summarise(count = n()
           ) %>%
  arrange(desc(count)
           ) 

df_top_4_groups <- df_aggregate %>%
  slice(1:4)

df_others <- df_aggregate %>%
  anti_join(df_top_4_groups, by = "group"
         ) %>%
  mutate(group = "others"
         ) %>%
  group_by(group
         ) %>%
  summarise(count = n()
         )

df_finale <- df_top_4_groups %>%
  bind_rows(df_others)

df_finale
A tibble: 5 x 2
  group  count
  <chr>  <int>
1 A          3
2 C          2
3 D          2
4 B          1
5 others     3

Your use of aggregate isn't wrong - quite cool ;) - but i think using the pipe from top to bottom makes it more readable.

  • Related