Home > Blockchain >  Aggregating / rolling up specific records "into" subsequent records
Aggregating / rolling up specific records "into" subsequent records

Time:09-22

I am trying to aggregate records with a specific type into subsequent records.

I have a dataset similar to the following:

df_initial <- data.frame("Id" = c(1, 2, 3, 4, 5), 
                         "Qty" = c(105, 110, 100, 115, 120), 
                         "Type" = c("A", "B", "B", "A", "A"), 
                         "Difference" = c(30, 34, 32, 30, 34))

After sorting on the Id field, I'd like to aggregate records of Type = "B" into the next record of type = "A".

In other words, I'm looking to create df_new, which adds the Qty and Difference values for Ids 2 and 3 into the Qty and Difference values for Id 4, and flags Id 4 as being adjusted (in the field AdjustedFlag).

df_new <- data.frame("Id" = c(1, 4, 5), 
                     "Qty" = c(105, 325, 120), 
                     "Type" = c("A", "A", "A"), 
                     "Difference" = c(30, 96, 34), 
                     "AdjustedFlag" = c(0, 1, 0))

I'd greatly appreciate any advice or ideas about how to do this in R, preferably using data.table.

CodePudding user response:

A data.table solution:

df_initial[, .(
    Id = Id[.N], Qty = sum(Qty),
    Difference = sum(Difference),
    AdjustedFlag =  (.N > 1)
  ), by = .(grp = rev(cumsum(rev(Type == "A"))))
  ][, grp := NULL][]
#       Id   Qty Difference AdjustedFlag
#    <num> <num>      <num>        <int>
# 1:     1   105         30            0
# 2:     4   325         96            1
# 3:     5   120         34            0

CodePudding user response:

This can be solved by creating a new grouping variable, that groups the rows into the groups you describe, with the idea being to utilize that grouping variable for the desired aggregation.

Instead of having

A B B A A

that new grouping variable should look something like this:

1 2 2 2 3

This is not a data.table solution, but the same logic could be applied there:

library(tidyverse)

df_initial |>
  mutate(
    type2 = ifelse(Type == "A", as.numeric(factor(Type)), 0),
    type2 = cumsum(type2),
    type2 = ifelse(Type == "B", NA, type2)
  ) |>
  fill(type2, .direction = "up") |>
  group_by(type2) |>
  summarise(
    id = max(Id),
    Qty = sum(Qty),
    Difference = sum(Difference),
    AdjustedFlag = as.numeric(n() > 1)
  )
#> # A tibble: 3 × 5
#>   type2    id   Qty Difference AdjustedFlag
#>   <dbl> <dbl> <dbl>      <dbl>        <dbl>
#> 1     1     1   105         30            0
#> 2     2     4   325         96            1
#> 3     3     5   120         34            0

CodePudding user response:

Using tidyverse

df_initial %>% 
  mutate(gn = if_else(lag(Type, default = 'A') == 'B' | Type == 'B', 'B', Type), 
                      gr = cumsum(lag(gn, default = 'A') != gn),
                      adjusted = if_else(lag(Type, default = 'A') == 'B' | Type == 'B', 1, 0)) %>% 
                      group_by(gr) %>% 
  summarise(Id = last(Id),
            Qty = sum(Qty),
            Type = 'A', 
            Difference = sum(Difference), 
            Adjusted_flg = max(adjusted)) %>% ungroup()

Here we create an interim dataset that looks like:

  Id Qty Type Difference gn gr  Adjusted
1  1 105    A         30  A  0         0
2  2 110    B         34  B  1         0
3  3 100    B         32  B  1         0
4  4 115    A         30  B  1         1
5  5 120    A         34  A  2         0

And use this to create our final table within the summarise. The gr is a column for indicating a group of values, which is why we group_by it.

  • Related