Here the part of my dataset
mydata=structure(list(sales_point_id = c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), calendar_id_operday = c(20210102L,
20210102L, 20210102L, 20210102L, 20210102L, 20210102L, 20210102L,
20210102L, 20210102L, 20210102L, 20210102L, 20210102L, 20210102L,
20210102L, 20210102L, 20210102L, 20210102L, 20210102L, 20210102L,
20210102L, 20210102L, 20210102L, 20210102L, 20210102L, 20210102L,
20210102L, 20210102L, 20210102L, 20210102L, 20210102L, 20210102L,
20210102L, 20210102L, 20210102L, 20210102L, 20210102L, 20210102L,
20210102L, 20210102L, 20210102L, 20210102L, 20210102L, 20210102L,
20210102L, 20210102L, 20210102L, 20210102L), line_fact_amt = c(23749L,
1000L, 3050L, 1550L, 8900L, 1550L, 0L, 300L, 0L, 499L, 5450L,
300L, 0L, 499L, 599L, 599L, 6050L, 300L, 599L, 1400L, 300L, 0L,
2000L, 700L, 0L, 5990L, 8877L, 1999L, 257L, 200L, 361L, 300L,
1990L, 2453L, 3140L, 0L, 0L, 199L, 599L, 10990L, 7990L, 773L,
400L, 6000L, 2269L, 2000L, 1999L)), class = "data.frame", row.names = c(NA,
-47L))
-calendar_id_operday
means day of week 20210102(YYYYMMDD)
.I need aggregate by sum line_fact_amt
for each day of calendar_id_operday
.
And the second ,there is dataset ,with that days but for 2022 year
mydata2=structure(list(sales_point_id = c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), calendar_id_operday = c(20220102L,
20220102L, 20220102L, 20220102L, 20220102L, 20220102L, 20220102L,
20220102L, 20220102L, 20220102L, 20220102L, 20220102L, 20220102L,
20220102L, 20220102L, 20220102L, 20220102L, 20220102L, 20220102L,
20220102L, 20220102L, 20220102L, 20220102L, 20220102L, 20220102L,
20220102L, 20220102L, 20220102L, 20220102L, 20220102L, 20220102L,
20220102L, 20220102L, 20220102L, 20220102L, 20220102L, 20220102L,
20220102L, 20220102L, 20220102L, 20220102L, 20220102L, 20220102L,
20220102L, 20220102L, 20220102L, 20220102L), line_fact_amt = c(38586L,
15837L, 17887L, 16387L, 23737L, 16387L, 14837L, 15137L, 14837L,
15336L, 20287L, 15137L, 14837L, 15336L, 15436L, 15436L, 20887L,
15137L, 15436L, 16237L, 15137L, 14837L, 16837L, 15537L, 14837L,
20827L, 23714L, 16836L, 15094L, 15037L, 15198L, 15137L, 16827L,
17290L, 17977L, 14837L, 14837L, 15036L, 15436L, 25827L, 22827L,
15610L, 15237L, 20837L, 17106L, 16837L, 16836L)), class = "data.frame", row.names = c(NA,
-47L))
also aggregate line_fact_amt
by calendar_id_operday
.
The Greatest Difficulty for me: if the aggregated value of the same day of the next year(2022) is greater than for the same day but of the last year(2021), well, for example, for 20220102 the sum = 815519
, while for the same day but 2021 = 118180
, we have a difference for this day between 2021 and 2022 more than 40%, and if there is a difference for the same day>40%, then replace for the next year with the values of the previous year of this date 10%, i.e. 815519 is replaced by 118180 10% of it (129998).
How to do this procedure for sales_point_id separately?
Thanks for your valuable help.
CodePudding user response:
Based on the description, perhaps this helps
library(lubridate)
library(dplyr)
agg1 <- mydata %>%
group_by(calendar_id_operday) %>%
summarise(line_fact_amt = sum(line_fact_amt, na.rm = TRUE))
agg2 <- mydata2 %>%
group_by(calendar_id_operday) %>%
summarise(line_fact_amt = sum(line_fact_amt, na.rm = TRUE))
agg2 %>%
mutate(calendar_prev = as.integer(format(ymd(calendar_id_operday) -
years(1), '%Y%m%d')) ) %>%
full_join(agg1, ., by = c(calendar_id_operday = "calendar_prev")) %>%
mutate(changed = ifelse((line_fact_amt.y - line_fact_amt.x) >= 0.4 *
line_fact_amt.y, line_fact_amt.x 0.1 * line_fact_amt.x, line_fact_amt.y ))