I have a dataframe like df:
id <- c("aa", "bb", "cc", "dd", "ee", "ff", "gg", "hh", "ii", "jj")
value <- c(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
df <- data.frame(id, value)
I want to aggregate the values that are rank below n-th position (in this case, n=5), such that, the new data looks like df_new:
id_new <- c("aa", "bb", "cc", "dd", "ee", "others")
value_new <- c(10, 9, 8, 7, 6, 15)
df_new <- data.frame(id_new, value_new)
I have tried to combine top_n
with summarise
in dplyr but I do not find the way to aggregate a dataset partially.
Any idea?
CodePudding user response:
One way could be to utilize the fct_lump_n
in forcats
(part of tidyverse
):
The idea is to treat id as a factor with x levels, fct_lump_n
lumps all but n
into the same factor. We could then use count
to summarize all the others-factor-level into one. I.e.
library(forcats)
library(dplyr)
df |>
mutate(id = fct_lump_n(as.factor(id), n = 5, w = value, other_level = "others")) |>
count(id, wt = value, name = "value")
Or use a similar logic with dplyr
alone: As we have no value-weighting in the first place, we'll need to arrange before recoding.
library(dplyr)
df |>
arrange(desc(value)) |>
mutate(id = if_else(row_number() > 5, "others", id)) |>
count(id, wt = value, name = "value")
Output:
id value
1 aa 10
2 bb 9
3 cc 8
4 dd 7
5 ee 6
6 others 15
CodePudding user response:
I am sure that there are that there are more elegant ways, but you could just split the operation into two separat ones and then combine the resulting data frames. Something like this should work:
tmp1<-df %>%
slice_max(value, n=5)
tmp2<-df %>%
slice_min(value,n=nrow(.)-5) %>%
dplyr::summarise(others=sum(value)) %>%
t() %>%
enframe(name = "id", value = "value")
out_df<-rbind(tmp1,tmp2)