The minimal sample of my data is;
library(data.table)
set.seed(1)
tbl <- data.table(store = c('A','A','A','A','A','A','A','B','B','B','B','B','B','C','C','C','C'),
year = c(2017,2017,2017,2017,2018,2018,2018,2017,2017,2017,2017,2017,2017,2017,2017,2017,2018),
week = c(12,13,15,16,2,3,4,18,19,20,22,24,25,1,2,3,2),
insert = sample(0:1,size = 17,replace = T),
demand = sample(200:250,size = 17,replace=T))
I'd like to create a summary table which is calculated in this way;
to calculate the effect where the insert column is 1, I must divide the demand value of the insert column by the average of the demand value of the nearest weeks within a year, which takes the value of the insert column zero around that week
For example, if I need to calculate for store A at 2017 and 3rd week, it must be:
rate <- 209 / mean(c(206,245))
but if the year has more than one inserts, like for store B at year 2017 it should be means of the rates (for week 19 and 20):
rate <- mean(224 / mean(c(240,236)), 245 / mean(c(240,236)))
and if I can't find two values around it, I have to calculate a ratio by dividing it by the only value I find. for example store C at year 2017:
rate <- 243 / 224
if I can find no value around the inserted row, I need to pass 1.
finally the summary table should look like;
desired_tbl <- data.table(store = c('A','A','B','C','C'),
year = c(2017,2018,2017,2017,2018),
rate = c(0.927,0.97,0.941,1.08,1))
desired_tbl
store year rate
<chr> <dbl> <dbl>
1 A 2017 0.927
2 A 2018 0.97
3 B 2017 0.941
4 C 2017 1.08
5 C 2018 1
I can do all of them by writing for loops and tons of conditions but I'm looking for a vectorised way to accomplish. dplyr solutions are also welcome.
Thanks in advance.
CodePudding user response:
Here is one possible way
- First, identify the store-years, where the
insert=1
rows are consecutive, and collapse them (i.e, take the average of those). In your example, there is only one like this (2017-B, weeks 19/20). Take these collapsed rows, and bind them back to theinsert=0
rows, ordering back tostore
,year
,week
tbl[, `:=`(demand=as.double(demand), insert_id=rleid(insert))]
tbl <- rbind(
tbl[insert==0],
tbl[insert==1][, lapply(.SD, mean, na.rm=T), by=.(store,year,insert_id)]
)[order(store,year,week)]
- Next, get the values before and after the
insert=1
rows, generate the mean (m
) of those values by row, estimate the ratio ofdemand
tom
, and retain theinsert=1
rows.
tbl[, c("v1","v2"):=shift(demand,c(-1,1)), by=.(store, year)]
tbl[, m:=mean(c(v1,v2), na.rm=T), by=1:nrow(tbl)]
tbl[, rate:=demand/m][insert==1,.(rate=mean(fifelse(is.na(rate),1,rate))), by=.(store,year)]
Output:
store year rate
<char> <num> <num>
1: A 2017 0.9268293
2: A 2018 0.9727273
3: B 2017 0.9852941
4: C 2017 1.0848214
5: C 2018 1.0000000
Its not identical as your desired output, because I believe you have miscalculated 2017-B.
CodePudding user response:
Replace demands with insert values of 1 with NA and then fill them in forwards and backwards by group taking the mean of the two directions and dividng into demand. Then keep only the insert 1 rows and aggregate each group using what is left. If any rate is not finite use 1.
library(dplyr)
library(zoo)
tbl %>%
mutate(d = ifelse(insert == 1, NA, demand)) %>%
group_by(store, year) %>%
mutate(rate = demand /
rowMeans(cbind(na.locf0(d), na.locf0(d, fromLast = TRUE)), na.rm = TRUE)) %>%
filter(insert == 1) %>%
summarize(rate = mean(rate, na.rm = TRUE), .groups = "drop") %>%
mutate(rate = ifelse(is.finite(rate), rate, 1))
giving:
# A tibble: 5 x 3
store year rate
<chr> <dbl> <dbl>
1 A 2017 0.927
2 A 2018 0.973
3 B 2017 0.985
4 C 2017 1.08
5 C 2018 1