In the following dataframe I would like to identify for every year the two observations by "id" with highest variable "op". Then aggregate those two highest observations of the variables op, tr, cp. I would appreciate an answer with dplyr. My original dataframe has thousands of hundreds of observations so I need something that I can adjust if I want to select the 1000 highest "op" observations by year.
Data:
year id op tr cp
1 1984 1 10 10 10
2 1985 1 20 20 20
3 1986 1 30 30 30
4 1987 1 40 40 40
5 1988 1 50 50 50
6 1985 2 15 15 15
7 1986 2 17 17 17
8 1987 2 18 18 18
9 1988 2 19 19 19
10 1985 3 20 20 20
11 1986 3 22 22 22
12 1986 4 10 10 10
13 1987 4 20 20 20
14 1988 4 40 40 40
Expected output:
year2 op2 tr2 cp2
1 1984 10 10 10
2 1985 40 40 40
3 1986 52 52 52
4 1987 60 60 60
5 1988 90 90 90
So in 1984 highestop aggregated by id=1, in the second id=1 and 3, in 1986 id= 1 and 3, in 1987 id= 1 and 4, 1988 id = 1 and 4.
I would like to avoid using a function but not so sure if that's possible. A well-functioning function would be could.
CodePudding user response:
You could group_by
and reframe
/summarise
across
the columns and sort
the values in descending order and select the two highest values like this (Please note: na.rm = TRUE
is used because your first group has only 1 value so with selecting two values one is NA):
library(dplyr)
df %>%
select(-id) %>%
group_by(year) %>%
reframe(across(op:cp, ~sum(sort(.x, decreasing = TRUE)[1:2], na.rm = TRUE)))
#> # A tibble: 5 × 4
#> year op tr cp
#> <int> <int> <int> <int>
#> 1 1984 10 10 10
#> 2 1985 40 40 40
#> 3 1986 52 52 52
#> 4 1987 60 60 60
#> 5 1988 90 90 90
Created on 2023-01-14 with reprex v2.0.2
One difference between reframe
and summarise
is that reframe returns an ungrouped dataframe.
You could also use summarise
like this:
library(dplyr)
df %>%
group_by(year) %>%
summarise(across(op:cp, ~sum(sort(.x, decreasing = TRUE)[1:2], na.rm = TRUE)))
CodePudding user response:
Data
data <-
structure(list(year = c(1984L, 1985L, 1986L, 1987L, 1988L, 1985L,
1986L, 1987L, 1988L, 1985L, 1986L, 1986L, 1987L, 1988L),
id = c(1L,1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L),
op = c(10L,20L, 30L, 40L, 50L, 15L, 17L, 18L, 19L, 20L, 22L, 10L, 20L, 40L),
tr = c(10L, 20L, 30L, 40L, 50L, 15L, 17L, 18L, 19L, 20L, 22L,10L, 20L, 40L),
cp = c(10L, 20L, 30L, 40L, 50L, 15L, 17L, 18L,19L, 20L, 22L, 10L, 20L, 40L)),
class = "data.frame",row.names = c(NA,-14L))
Code
library(dplyr)
data %>%
select(-id) %>%
group_by(year) %>%
slice_max(n = 2,order_by = op) %>%
summarise(across(.fns = ~sum(.,na.rm = TRUE)))
Output
# A tibble: 5 x 4
year op tr cp
<int> <int> <int> <int>
1 1984 10 10 10
2 1985 40 40 40
3 1986 52 52 52
4 1987 60 60 60
5 1988 90 90 90
CodePudding user response:
Using data.table
library(data.table)
setDT(df1)[, lapply(.SD, \(x) sum(head(x[order(-x)],2), na.rm = TRUE)),
year, .SDcols = op:cp]
-output
year op tr cp
1: 1984 10 10 10
2: 1985 40 40 40
3: 1986 52 52 52
4: 1987 60 60 60
5: 1988 90 90 90
CodePudding user response:
Please try the below code
library(dplyr)
data2 <- data_a %>% dplyr::arrange(year,desc(op),id) %>% group_by(year) %>%
slice_head(n=2) %>% mutate(across(c('op','tr','cp'), ~ sum(.x), .names = '{col}2')) %>% slice_head(n=1) %>%
select(-id,-op,-tr,-cp)