I have two databases, one with daily metrics df
and one with benchmarked metrics (with a date attached) benchmark_df
. I would like to find the percentage of the daily metrics compared to the max of the benchmark metrics - but only for benchmark dates that are less than or equal to the previous dates.
So a straight forward answer would be using left_join()
and mutate(perc_benchmark = metric_a / benchmark_metric_a)
where benchmark_metric_a
is the the max value less than or equal to the metric_a
date... but I'm not sure if I can do this with a left_join()
or if there is a workaround.
So in the example data frames below. Player B would have a perc_benchmark
of 50%
(0.5/1) from 8-1 to 8-6, 8-7 would be 25%
(0.5/2), and then 8-8 would be 75%
(1.5/2), etc.
df <- tibble(player = rep(LETTERS[1:2], times = 21),
date = as.Date(rep(seq(as.Date('2022-08-01'), by = 'day', length.out= 21), each = 2)),
# metric_a values are only patterned for testing purposes, they would not be in the actual dataframe
metric_a = c(rep(0.5, times = 14),rep(c(0.5, 1.5), times = 7),rep(c(0.5, 3), times = 7)))
benchmark_df <- tibble(player = rep(LETTERS[1:2], times = 3),
date = as.Date(rep(c('2022-08-01', '2022-08-07', '2022-08-16'), each = 2)),
benchmark_metric_a = c(1,1,0.5,2,1,3))
CodePudding user response:
The dplyr
package is developing support for non-equi *_join()
s.
...using
left_join()
...wherebenchmark_metric_a
is the the max value less than or equal to themetric_a date
...
In the meantime, here's a solution that uses a non-equi join with the data.table
package, whose performance is superior at scale.
Solution
First prepare your datasets as data.table
s.
library(data.table)
# ...
# Code to generate 'df' and 'benchmark_df'.
# ...
# Convert datasets into 'data.table's.
setDT(df)
setDT(benchmark_df)
Then apply the following data.table
"chain", which is analogous to the dplyr
workflow:
# Perform a LEFT JOIN of 'df' to 'benchmark_df'...
result <- benchmark_df[df, .(
# ...using a subset of columns...
player,
bench_date = x.date,
benchmark_metric_a,
metric_date = i.date,
metric_a
),
# ...where players match and benchmark dates are earlier (or concurrent).
on = .(player, date <= date)
# Calculate the max benchmark...
][, .(
benchmark_metric_a = max(benchmark_metric_a),
# ...while preserving the metric...
metric_a = first(metric_a)
),
# ...for each player as of each date.
by = .(player, metric_date)
# Calculate the percentage.
][, `:=`(
perc_benchmark = metric_a / benchmark_metric_a
# Select the desired columns.
)][, .(
player,
date = metric_date,
metric_a,
perc_benchmark
# Restore original ordering from 'df': sort by 'date' and then 'player'.
)][order(
date,
player
)]
Result
Given a df
and benchmark_df
like your examples
library(dplyr)
df <- tibble(
player = rep(LETTERS[1:2], times = 21),
date = as.Date(rep(seq(as.Date('2022-08-01'), by = 'day', length.out= 21), each = 2)),
# metric_a values are only patterned for testing purposes, they would not be in the actual dataframe
metric_a = c(rep(0.5, times = 14),rep(c(0.5, 1.5), times = 7),rep(c(0.5, 3), times = 7))
)
benchmark_df <- tibble(
player = rep(LETTERS[1:2], times = 3),
date = as.Date(rep(c('2022-08-01', '2022-08-07', '2022-08-16'), each = 2)),
benchmark_metric_a = c(1,1,0.5,2,1,3)
)
this chain should yield the following data.table
for result
, where the df
data is now augmented with the perc_benchmark
column:
player date metric_a perc_benchmark
1: A 2022-08-01 0.5 0.50
2: B 2022-08-01 0.5 0.50
3: A 2022-08-02 0.5 0.50
4: B 2022-08-02 0.5 0.50
5: A 2022-08-03 0.5 0.50
6: B 2022-08-03 0.5 0.50
7: A 2022-08-04 0.5 0.50
8: B 2022-08-04 0.5 0.50
9: A 2022-08-05 0.5 0.50
10: B 2022-08-05 0.5 0.50
11: A 2022-08-06 0.5 0.50
12: B 2022-08-06 0.5 0.50
13: A 2022-08-07 0.5 0.50
14: B 2022-08-07 0.5 0.25
15: A 2022-08-08 0.5 0.50
16: B 2022-08-08 1.5 0.75
17: A 2022-08-09 0.5 0.50
18: B 2022-08-09 1.5 0.75
19: A 2022-08-10 0.5 0.50
20: B 2022-08-10 1.5 0.75
21: A 2022-08-11 0.5 0.50
22: B 2022-08-11 1.5 0.75
23: A 2022-08-12 0.5 0.50
24: B 2022-08-12 1.5 0.75
25: A 2022-08-13 0.5 0.50
26: B 2022-08-13 1.5 0.75
27: A 2022-08-14 0.5 0.50
28: B 2022-08-14 1.5 0.75
29: A 2022-08-15 0.5 0.50
30: B 2022-08-15 3.0 1.50
31: A 2022-08-16 0.5 0.50
32: B 2022-08-16 3.0 1.00
33: A 2022-08-17 0.5 0.50
34: B 2022-08-17 3.0 1.00
35: A 2022-08-18 0.5 0.50
36: B 2022-08-18 3.0 1.00
37: A 2022-08-19 0.5 0.50
38: B 2022-08-19 3.0 1.00
39: A 2022-08-20 0.5 0.50
40: B 2022-08-20 3.0 1.00
41: A 2022-08-21 0.5 0.50
42: B 2022-08-21 3.0 1.00
player date metric_a perc_benchmark
Note
If you want to convert this back into tibble
form, then simply use as_tibble()
on the result
.