Calculating the difference between first and last row in each group-CodePudding

I have a dataframe containing date, price, and a binary group (1/0 for discount or not discount) as show in the picture:

sample

structure(list(date = structure(c(16443, 16444, 16447, 16448,16449,16457, 16458, 16461, 16462, 16463, 16464, 16465, 16468, 16469, 16470, 16471, 16472, 16475, 16476, 16484, 16437, 16440, 16441,16442, 16450, 16451, 16454, 16455, 16456, 16477, 16478, 16479,16482, 16483, 16492, 16493, 16496, 16497, 16498, 16520), class ="Date"), price = c(8.066, 7.918, 7.856, 7.82, 7.828, 7.37, 7.283,7.299, 7.25, 7.228, 7.219, 7.168, 7.118, 7.031, 7.098,7.11,7.156, 7.189, 7.195, 7.195, 7.859, 7.964, 7.942, 8.117, 7.791, 7.785, 7.789, 7.446, 7.401, 7.412, 7.529, 7.495, 7.456, 7.393, 7.104, 7.062, 7.081, 7.115, 7.289, 7.378), discount = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"),row.names = c(NA, -40L), groups = structure(list(   discount = c(0, 1), .rows = structure(list(1:20, 21:40), ptype =integer(0), class = c("vctrs_list_of",  "vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))

A discount group represents individual period which has various lengths of days (it can be thought as discount week).

I would like to calculate the difference between first and last price in each group of discount.

Any advice would be appreciated.

CodePudding user response：

One way would be the following:

library(dplyr)

df %>%
  arrange(discount, date) %>% 
  group_by(discount) %>% 
  summarise(diff = first(price) - last(price))

# # A tibble: 2 x 2
# discount  diff
# <dbl> <dbl>
# 1        0 0.871
# 2        1 0.481

You can replace summarise(diff = first(price) - last(price)) with summarise(diff = head(price, 1) - tail(price, 1)), which will generate the same output.

CodePudding user response：

(Assuming dplyr.) Not assuming that date is guaranteed to be in order; if it is, then one could also use first(.)/last(.) for the same results. I tend to prefer not trusting order ...)

If your discount is always 0/1 and you are looking to group by contiguous same-values, then

dat %>%
  group_by(discountgrp = cumsum(discount != lag(discount, default = discount[1]))) %>%
  summarize(change = price[which.max(date)] - price[which.min(date)])
# # A tibble: 2 x 2
#   discountgrp change
#         <int>  <dbl>
# 1           0 -0.871
# 2           1 -0.481

If your discount is instead a categorical value and can exceed 1, then

dat %>%
  group_by(discount) %>%
  summarize(change = price[which.max(date)] - price[which.min(date)])
# # A tibble: 2 x 2
#   discount change
#      <dbl>  <dbl>
# 1        0 -0.871
# 2        1 -0.481

They happen to be the same here, but if the row order were changed such that some of the 1s occurred in the middle of 0s (for instance), then the groups would be different.

CodePudding user response：

Using data.table:

library(data.table)

setDT(x)[, grp := rleid(discount) 
         ][order(grp, date), .(priceDiff = head(price, 1) - tail(price, 1)), by = grp ]

#    grp priceDiff
# 1:   1     0.871
# 2:   2     0.481

CodePudding user response：

Base R solution

sort the dataframe, just to be sure

df=df[order(df$date),]

create a new variable for grouping

df["grp"]=cumsum(df["discount"]!=rbind(-1,head(df["discount"],-1)))

aggregate and difference

aggregate(
  price~grp,
  df,
  function(x){head(x,1)-tail(x,1)}
)

  grp  price
1   1 -0.258
2   2  0.238
3   3  0.390
4   4  0.175
5   5  0.019
6   6  0.000
7   7 -0.274

CodePudding user response：

Using tapply.

res <- with(DF, tapply(price, list(week=I(cumsum(c(0, diff(discount)) != 0)   1)), 
                       \(x) x[length(x)] - x[1]))
res
# week
#      1      2      3      4      5      6      7      8      9     10 
# -0.246 -0.529 -0.082 -0.008  0.039  0.258 -0.345  0.094 -0.394  0.297

Or ave.

DF <- transform(DF, d=ave(price, cumsum(c(0, diff(discount)) != 0), 
                          FUN=\(x) x[length(x)] - x[1]))
DF
#          date price discount      d
# 1  2015-01-08 8.066        0 -0.246
# 2  2015-01-09 7.918        0 -0.246
# 3  2015-01-12 7.856        0 -0.246
# 4  2015-01-13 7.820        0 -0.246
# 5  2015-01-14 7.828        1 -0.529
# 6  2015-01-22 7.370        1 -0.529
# 7  2015-01-23 7.283        1 -0.529
# 8  2015-01-26 7.299        1 -0.529
# 9  2015-01-27 7.250        0 -0.082

Data

DF <- structure(list(date = structure(c(16443, 16444, 16447, 16448, 
16449, 16457, 16458, 16461, 16462, 16463, 16464, 16465, 16468, 
16469, 16470, 16471, 16472, 16475, 16476, 16484, 16437, 16440, 
16441, 16442, 16450, 16451, 16454, 16455, 16456, 16477, 16478, 
16479, 16482, 16483, 16492, 16493, 16496, 16497, 16498, 16520
), class = "Date"), price = c(8.066, 7.918, 7.856, 7.82, 7.828, 
7.37, 7.283, 7.299, 7.25, 7.228, 7.219, 7.168, 7.118, 7.031, 
7.098, 7.11, 7.156, 7.189, 7.195, 7.195, 7.859, 7.964, 7.942, 
8.117, 7.791, 7.785, 7.789, 7.446, 7.401, 7.412, 7.529, 7.495, 
7.456, 7.393, 7.104, 7.062, 7.081, 7.115, 7.289, 7.378), discount = c(0L, 
0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 
0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 
0L, 0L, 0L, 1L, 1L, 1L, 1L)), row.names = c(NA, -40L), class = "data.frame")

CodePudding user response：

We may also use which.max and which.min

library(dplyr)
df1 %>% 
   group_by(discount) %>% 
   summarise(diff = price[which.max(date)] - price[which.min(date)])