Home > other >  Partially aggregate (sum) values that are below the first n observations in dplyr
Partially aggregate (sum) values that are below the first n observations in dplyr

Time:12-09

I have a dataframe like df:

id <- c("aa", "bb", "cc", "dd", "ee", "ff", "gg", "hh", "ii", "jj")
value <- c(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
df <- data.frame(id, value)

I want to aggregate the values that are rank below n-th position (in this case, n=5), such that, the new data looks like df_new:

id_new <- c("aa", "bb", "cc", "dd", "ee", "others") 
value_new <- c(10, 9, 8, 7, 6, 15)
df_new <- data.frame(id_new, value_new)

I have tried to combine top_n with summarise in dplyr but I do not find the way to aggregate a dataset partially.

Any idea?

CodePudding user response:

One way could be to utilize the fct_lump_n in forcats (part of tidyverse):

The idea is to treat id as a factor with x levels, fct_lump_n lumps all but n into the same factor. We could then use count to summarize all the others-factor-level into one. I.e.

library(forcats)
library(dplyr)

df |>
  mutate(id = fct_lump_n(as.factor(id), n = 5, w = value, other_level = "others")) |>
  count(id, wt = value, name = "value")

Or use a similar logic with dplyr alone: As we have no value-weighting in the first place, we'll need to arrange before recoding.

library(dplyr)

df |>
  arrange(desc(value)) |> 
  mutate(id = if_else(row_number() > 5, "others", id)) |>
  count(id, wt = value, name = "value")

Output:

      id value
1     aa    10
2     bb     9
3     cc     8
4     dd     7
5     ee     6
6 others    15

CodePudding user response:

I am sure that there are that there are more elegant ways, but you could just split the operation into two separat ones and then combine the resulting data frames. Something like this should work:

tmp1<-df %>% 
  slice_max(value, n=5)

tmp2<-df %>% 
  slice_min(value,n=nrow(.)-5) %>%
  dplyr::summarise(others=sum(value)) %>%
  t() %>% 
  enframe(name = "id", value = "value")

out_df<-rbind(tmp1,tmp2)

  • Related