Home > Mobile >  Window function on grouped dataframe get difference between two rows after a filter
Window function on grouped dataframe get difference between two rows after a filter

Time:09-25

A grouped data frame:

grp_diamonds <- diamonds %>% 
  group_by(cut, color) %>% 
  mutate(rn = row_number()) %>% 
  arrange(cut, color, rn) %>% 
  mutate(cumprice = cumsum(price))

Looks like:

grp_diamonds
# A tibble: 53,940 × 12
# Groups:   cut, color [35]
   carat cut   color clarity depth table price     x     y     z    rn cumprice
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>    <int>
 1  0.75 Fair  D     SI2      64.6    57  2848  5.74  5.72  3.7      1     2848
 2  0.71 Fair  D     VS2      56.9    65  2858  5.89  5.84  3.34     2     5706
 3  0.9  Fair  D     SI2      66.9    57  2885  6.02  5.9   3.99     3     8591
 4  1    Fair  D     SI2      69.3    58  2974  5.96  5.87  4.1      4    11565
 5  1.01 Fair  D     SI2      64.6    56  3003  6.31  6.24  4.05     5    14568
 6  0.73 Fair  D     VS1      66      54  3047  5.56  5.66  3.7      6    17615
 7  0.71 Fair  D     VS2      64.7    58  3077  5.61  5.58  3.62     7    20692
 8  0.91 Fair  D     SI2      62.5    66  3079  6.08  6.01  3.78     8    23771
 9  0.9  Fair  D     SI2      65.9    59  3205  6     5.95  3.94     9    26976
10  0.9  Fair  D     SI2      66      58  3205  6     5.97  3.95    10    30181

Within each group, I would like to add a new field 'GROWTH_6_7' which is the delta between cumprice at rn = 7 - rn = 6.

Read documentation and tried and failed using cur_data() with mutate. Maybe that's the right path or maybe there's a 'better' way?

How can I mutate a new field within each group 'GROWTH_6_7' that is the difference between rn == 7 and rn ==6 cumprice?

CodePudding user response:

GrowDelta <- function(data, start_row = 6, end_row = 7){
  data$cumprice[end_row] - data$cumprice[start_row]
}

grp_diamonds %>% 
  summarize(GROWTH_6_7 = GrowDelta(cur_data()))

mutate instead of summarize should work, too. It will just repeat it for every row in the group instead of just once for each group, i.e. will result in tibble with the same number of rows as the data set. Using summarize will give you a 35 x 3 tibble.

CodePudding user response:

You may try group_modify:

Code

grow <- grp_diamonds %>% 
  group_by(cut, color) %>% 
  group_modify(~{
    .x %>% 
      mutate(GROWTH_6_7 = .x$cumprice[.x$rn == 7] - .x$cumprice[.x$rn == 6])
  })

Output

> head(grow)
# A tibble: 6 x 13
# Groups:   cut, color [1]
  cut   color carat clarity depth table price     x     y     z    rn cumprice GROWTH_6_7
  <ord> <ord> <dbl> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>    <int>      <int>
1 Fair  D      0.75 SI2      64.6    57  2848  5.74  5.72  3.7      1     2848       3077
2 Fair  D      0.71 VS2      56.9    65  2858  5.89  5.84  3.34     2     5706       3077
3 Fair  D      0.9  SI2      66.9    57  2885  6.02  5.9   3.99     3     8591       3077
4 Fair  D      1    SI2      69.3    58  2974  5.96  5.87  4.1      4    11565       3077
5 Fair  D      1.01 SI2      64.6    56  3003  6.31  6.24  4.05     5    14568       3077
6 Fair  D      0.73 VS1      66      54  3047  5.56  5.66  3.7      6    17615       3077

CodePudding user response:

We could do this within mutate itself`

library(dplyr)
grp_diamonds %>%
    group_by(cut, color) %>% 
    mutate(GROWTH_6_7 = cumprice[rn == 7] - cumprice[rn == 6])

-output

# A tibble: 53,940 x 13
# Groups:   cut, color [35]
   carat cut   color clarity depth table price     x     y     z    rn cumprice GROWTH_6_7
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>    <int>      <int>
 1  0.75 Fair  D     SI2      64.6    57  2848  5.74  5.72  3.7      1     2848       3077
 2  0.71 Fair  D     VS2      56.9    65  2858  5.89  5.84  3.34     2     5706       3077
 3  0.9  Fair  D     SI2      66.9    57  2885  6.02  5.9   3.99     3     8591       3077
 4  1    Fair  D     SI2      69.3    58  2974  5.96  5.87  4.1      4    11565       3077
 5  1.01 Fair  D     SI2      64.6    56  3003  6.31  6.24  4.05     5    14568       3077
 6  0.73 Fair  D     VS1      66      54  3047  5.56  5.66  3.7      6    17615       3077
 7  0.71 Fair  D     VS2      64.7    58  3077  5.61  5.58  3.62     7    20692       3077
 8  0.91 Fair  D     SI2      62.5    66  3079  6.08  6.01  3.78     8    23771       3077
 9  0.9  Fair  D     SI2      65.9    59  3205  6     5.95  3.94     9    26976       3077
10  0.9  Fair  D     SI2      66      58  3205  6     5.97  3.95    10    30181       3077
# … with 53,930 more rows

If there are cases where there are some missing values, then another option is pivot_wider

library(tidyr)
grp_diamonds %>%
    ungroup %>%
    select(cut, color, rn, cumprice) %>%
    filter(rn %in% 6:7) %>% 
    pivot_wider(names_from = rn, values_from = cumprice) %>% 
    transmute(cut, color, GROWTH_6_7 = `7` - `6`) %>% 
    left_join(grp_diamonds, .)
  • Related