Home > Blockchain >  Identify top n observations and aggregate by year all variables in dataframe R
Identify top n observations and aggregate by year all variables in dataframe R

Time:01-15

In the following dataframe I would like to identify for every year the two observations by "id" with highest variable "op". Then aggregate those two highest observations of the variables op, tr, cp. I would appreciate an answer with dplyr. My original dataframe has thousands of hundreds of observations so I need something that I can adjust if I want to select the 1000 highest "op" observations by year.

Data:

year id op tr cp
1  1984  1 10 10 10
2  1985  1 20 20 20
3  1986  1 30 30 30
4  1987  1 40 40 40
5  1988  1 50 50 50
6  1985  2 15 15 15
7  1986  2 17 17 17
8  1987  2 18 18 18
9  1988  2 19 19 19
10 1985  3 20 20 20
11 1986  3 22 22 22
12 1986  4 10 10 10
13 1987  4 20 20 20
14 1988  4 40 40 40

Expected output:

year2 op2 tr2 cp2
1  1984  10  10  10
2  1985  40  40  40
3  1986  52  52  52
4  1987  60  60  60
5  1988  90  90  90

So in 1984 highestop aggregated by id=1, in the second id=1 and 3, in 1986 id= 1 and 3, in 1987 id= 1 and 4, 1988 id = 1 and 4.

I would like to avoid using a function but not so sure if that's possible. A well-functioning function would be could.

CodePudding user response:

You could group_by and reframe/summarise across the columns and sort the values in descending order and select the two highest values like this (Please note: na.rm = TRUE is used because your first group has only 1 value so with selecting two values one is NA):

library(dplyr)
df %>%
  select(-id) %>%
  group_by(year) %>%
  reframe(across(op:cp, ~sum(sort(.x, decreasing = TRUE)[1:2], na.rm = TRUE)))
#> # A tibble: 5 × 4
#>    year    op    tr    cp
#>   <int> <int> <int> <int>
#> 1  1984    10    10    10
#> 2  1985    40    40    40
#> 3  1986    52    52    52
#> 4  1987    60    60    60
#> 5  1988    90    90    90

Created on 2023-01-14 with reprex v2.0.2

One difference between reframe and summarise is that reframe returns an ungrouped dataframe.


You could also use summarise like this:

library(dplyr)
df %>%
  group_by(year) %>%
  summarise(across(op:cp, ~sum(sort(.x, decreasing = TRUE)[1:2], na.rm = TRUE)))

CodePudding user response:

Data

data <-
      structure(list(year = c(1984L, 1985L, 1986L, 1987L, 1988L, 1985L, 
                              1986L, 1987L, 1988L, 1985L, 1986L, 1986L, 1987L, 1988L),
                     id = c(1L,1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L),
                     op = c(10L,20L, 30L, 40L, 50L, 15L, 17L, 18L, 19L, 20L, 22L, 10L, 20L, 40L),
                     tr = c(10L, 20L, 30L, 40L, 50L, 15L, 17L, 18L, 19L, 20L, 22L,10L, 20L, 40L),
                     cp = c(10L, 20L, 30L, 40L, 50L, 15L, 17L, 18L,19L, 20L, 22L, 10L, 20L, 40L)),
                class = "data.frame",row.names = c(NA,-14L))

Code

library(dplyr)

data %>% 
  select(-id) %>%
  group_by(year) %>% 
  slice_max(n = 2,order_by = op) %>% 
  summarise(across(.fns = ~sum(.,na.rm = TRUE)))

Output

# A tibble: 5 x 4
   year    op    tr    cp
  <int> <int> <int> <int>
1  1984    10    10    10
2  1985    40    40    40
3  1986    52    52    52
4  1987    60    60    60
5  1988    90    90    90

CodePudding user response:

Using data.table

library(data.table)
setDT(df1)[, lapply(.SD, \(x) sum(head(x[order(-x)],2), na.rm = TRUE)),
    year, .SDcols = op:cp]

-output

   year op tr cp
1: 1984 10 10 10
2: 1985 40 40 40
3: 1986 52 52 52
4: 1987 60 60 60
5: 1988 90 90 90

CodePudding user response:

Please try the below code

library(dplyr)

data2 <- data_a %>% dplyr::arrange(year,desc(op),id) %>% group_by(year) %>% 
slice_head(n=2) %>% mutate(across(c('op','tr','cp'), ~ sum(.x), .names = '{col}2')) %>% slice_head(n=1) %>% 
select(-id,-op,-tr,-cp)
  • Related