I got the data from MySQL and I'm trying to visualize it and uncover some answers. Using R for the statistic.
The final product is % discount for reach price change (=row).
Here is an example of my dataset.
itemId pricehis timestamp
1 69295477 1290 2022-04-12 04:42:53
2 69295624 1145 2022-04-12 04:42:53
3 69296136 3609 2022-04-12 04:42:54
4 69296607 855 2022-04-12 04:42:53
5 69295291 1000 2022-04-12 04:42:50
6 69295475 4188 2022-04-12 04:42:52
7 69295614 1145 2022-04-12 04:42:51
8 69295622 1290 2022-04-12 04:42:50
9 69295692 3609 2022-04-12 04:42:49
10 69295917 1725 2022-04-12 04:42:48
11 69296090 2449 2022-04-12 04:42:53
12 69296653 1145 2022-04-12 04:42:51
13 69296657 5638 2022-04-12 04:42:48
14 69296661 1725 2022-04-12 04:42:51
15 69296696 710 2022-04-12 04:42:51
I've been stuck at one part of the calculation - maximum value for each productId in 6 months.
In the dataset there are rows for specific productId with different pricehis values and different timestamps. I need to find the max value for a given row no older than 6 months.
The formula for calculating the desired discount is:
Discount grouped by itemId = 1 - pricehis / max(pricehis in the last 6 months)
At this moment I'm unable to solve the second part - pricehis in the last 6 months. - I need a new column with maximum 'pricehis' in the last 6 months for the itemId. Also could be known as interval maximum.
I can group it by the itemId, but I can't figure out how to add the condition on 6 months max.
Any tips on how to get this?
CodePudding user response:
I like slider::slide_index_dbl
for this sort of thing. Here's some fake data chosen to demonstrate the 6mo window:
data.frame(itemId = rep(1:2, each = 6),
price = floor(100*cos(0:11)^2),
timestamp = as.Date("2000-01-01") 50*(0:11)) -> df
We can group by itemId and then apply the window function. (Note that slider
requires the data to be sorted by date within each group.)
library(dplyr)
library(lubridate) # for `%m-%`, to get sliding months (harder than it sounds!)
df %>%
group_by(itemId) %>%
mutate(max_6mo = slider::slide_index_dbl(price, timestamp, max,
.before = ~.x %m-% months(6))) %>%
mutate(discount = 1 - price / max_6mo) %>%
ungroup()
Result
# A tibble: 12 × 5
itemId price timestamp max_6mo discount
<int> <dbl> <date> <dbl> <dbl>
1 1 100 2000-01-01 100 0
2 1 29 2000-02-20 100 0.71
3 1 17 2000-04-10 100 0.83
4 1 98 2000-05-30 100 0.0200
5 1 42 2000-07-19 98 0.571 # new max since >6mo since 100
6 1 8 2000-09-07 98 0.918
7 2 92 2000-10-27 92 0
8 2 56 2000-12-16 92 0.391
9 2 2 2001-02-04 92 0.978
10 2 83 2001-03-26 92 0.0978
11 2 70 2001-05-15 83 0.157 # new max since >6mo since 92
12 2 0 2001-07-04 83 1