I have a dataframe containing date, price, and a binary group (1/0 for discount or not discount) as show in the picture:
structure(list(date = structure(c(16443, 16444, 16447, 16448,16449,16457, 16458, 16461, 16462, 16463, 16464, 16465, 16468, 16469, 16470, 16471, 16472, 16475, 16476, 16484, 16437, 16440, 16441,16442, 16450, 16451, 16454, 16455, 16456, 16477, 16478, 16479,16482, 16483, 16492, 16493, 16496, 16497, 16498, 16520), class ="Date"), price = c(8.066, 7.918, 7.856, 7.82, 7.828, 7.37, 7.283,7.299, 7.25, 7.228, 7.219, 7.168, 7.118, 7.031, 7.098,7.11,7.156, 7.189, 7.195, 7.195, 7.859, 7.964, 7.942, 8.117, 7.791, 7.785, 7.789, 7.446, 7.401, 7.412, 7.529, 7.495, 7.456, 7.393, 7.104, 7.062, 7.081, 7.115, 7.289, 7.378), discount = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"),row.names = c(NA, -40L), groups = structure(list( discount = c(0, 1), .rows = structure(list(1:20, 21:40), ptype =integer(0), class = c("vctrs_list_of", "vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
A discount group represents individual period which has various lengths of days (it can be thought as discount week).
I would like to calculate the difference between first and last price in each group of discount.
Any advice would be appreciated.
CodePudding user response:
One way would be the following:
library(dplyr)
df %>%
arrange(discount, date) %>%
group_by(discount) %>%
summarise(diff = first(price) - last(price))
# # A tibble: 2 x 2
# discount diff
# <dbl> <dbl>
# 1 0 0.871
# 2 1 0.481
You can replace summarise(diff = first(price) - last(price))
with summarise(diff = head(price, 1) - tail(price, 1))
, which will generate the same output.
CodePudding user response:
(Assuming dplyr
.) Not assuming that date
is guaranteed to be in order; if it is, then one could also use first(.)
/last(.)
for the same results. I tend to prefer not trusting order ...)
If your discount
is always 0/1 and you are looking to group by contiguous same-values, then
dat %>%
group_by(discountgrp = cumsum(discount != lag(discount, default = discount[1]))) %>%
summarize(change = price[which.max(date)] - price[which.min(date)])
# # A tibble: 2 x 2
# discountgrp change
# <int> <dbl>
# 1 0 -0.871
# 2 1 -0.481
If your discount
is instead a categorical value and can exceed 1, then
dat %>%
group_by(discount) %>%
summarize(change = price[which.max(date)] - price[which.min(date)])
# # A tibble: 2 x 2
# discount change
# <dbl> <dbl>
# 1 0 -0.871
# 2 1 -0.481
They happen to be the same here, but if the row order were changed such that some of the 1
s occurred in the middle of 0
s (for instance), then the groups would be different.
CodePudding user response:
Using data.table:
library(data.table)
setDT(x)[, grp := rleid(discount)
][order(grp, date), .(priceDiff = head(price, 1) - tail(price, 1)), by = grp ]
# grp priceDiff
# 1: 1 0.871
# 2: 2 0.481
CodePudding user response:
Base R solution
sort the dataframe, just to be sure
df=df[order(df$date),]
create a new variable for grouping
df["grp"]=cumsum(df["discount"]!=rbind(-1,head(df["discount"],-1)))
aggregate and difference
aggregate(
price~grp,
df,
function(x){head(x,1)-tail(x,1)}
)
grp price
1 1 -0.258
2 2 0.238
3 3 0.390
4 4 0.175
5 5 0.019
6 6 0.000
7 7 -0.274
CodePudding user response:
Using tapply
.
res <- with(DF, tapply(price, list(week=I(cumsum(c(0, diff(discount)) != 0) 1)),
\(x) x[length(x)] - x[1]))
res
# week
# 1 2 3 4 5 6 7 8 9 10
# -0.246 -0.529 -0.082 -0.008 0.039 0.258 -0.345 0.094 -0.394 0.297
Or ave
.
DF <- transform(DF, d=ave(price, cumsum(c(0, diff(discount)) != 0),
FUN=\(x) x[length(x)] - x[1]))
DF
# date price discount d
# 1 2015-01-08 8.066 0 -0.246
# 2 2015-01-09 7.918 0 -0.246
# 3 2015-01-12 7.856 0 -0.246
# 4 2015-01-13 7.820 0 -0.246
# 5 2015-01-14 7.828 1 -0.529
# 6 2015-01-22 7.370 1 -0.529
# 7 2015-01-23 7.283 1 -0.529
# 8 2015-01-26 7.299 1 -0.529
# 9 2015-01-27 7.250 0 -0.082
Data
DF <- structure(list(date = structure(c(16443, 16444, 16447, 16448,
16449, 16457, 16458, 16461, 16462, 16463, 16464, 16465, 16468,
16469, 16470, 16471, 16472, 16475, 16476, 16484, 16437, 16440,
16441, 16442, 16450, 16451, 16454, 16455, 16456, 16477, 16478,
16479, 16482, 16483, 16492, 16493, 16496, 16497, 16498, 16520
), class = "Date"), price = c(8.066, 7.918, 7.856, 7.82, 7.828,
7.37, 7.283, 7.299, 7.25, 7.228, 7.219, 7.168, 7.118, 7.031,
7.098, 7.11, 7.156, 7.189, 7.195, 7.195, 7.859, 7.964, 7.942,
8.117, 7.791, 7.785, 7.789, 7.446, 7.401, 7.412, 7.529, 7.495,
7.456, 7.393, 7.104, 7.062, 7.081, 7.115, 7.289, 7.378), discount = c(0L,
0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L,
0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L,
0L, 0L, 0L, 1L, 1L, 1L, 1L)), row.names = c(NA, -40L), class = "data.frame")
CodePudding user response:
We may also use which.max
and which.min
library(dplyr)
df1 %>%
group_by(discount) %>%
summarise(diff = price[which.max(date)] - price[which.min(date)])