I am working with the R Programming language.
Suppose I have the following data:
myFun <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
name = myFun(400)
variable = as.integer(abs(rnorm(400, 500,100)))
my_data = data.frame(name,variable)
I want to keep the top 5 rows (based on the value of "variable") and group (sum) everything else as "other". I thought of the following way to do this:
my_data <- my_data [order(-variable),]
my_data_top_5 = my_data[1:5,]
my_data_remainder = my_data[6:nrow(my_data),]
other_count = sum(my_data_remainder$variable)
other = data.frame( name = "other", variable = other_count)
final_result = rbind(my_data_top_5, other)
I think this worked - but is there a more efficient way to do this?
Thanks!
CodePudding user response:
In tidyverse
, arrange
the data based on the desc
ending order of 'variable', replace
the 'name' from 6th onwards to 'other' and do a group by sum
library(dplyr)
my_data %>%
arrange(desc(variable)) %>%
group_by(name = replace(name, 6:n(), "other")) %>%
summarise(variable = sum(variable, na.rm = TRUE), .groups = 'drop')
CodePudding user response:
Here is another approach:
library(dplyr)
my_data %>%
arrange(-variable) %>%
slice(6:n()) %>%
summarise(variable = sum(variable)) %>%
mutate(name = "other", .before=1) %>%
bind_rows(my_data) %>%
arrange(-variable) %>%
slice(1:6) %>%
arrange(variable)
name variable
1 JRZQF6858X 724
2 DYYVV2422L 734
3 QKQRX2862B 741
4 XBINQ6194M 776
5 DZMGX4300N 796
6 other 195240