I have two databases, one with daily metrics df and one with benchmarked metrics (with a date attached) benchmark_df. I would like to find the percentage of the daily metrics compared to the max of the benchmark metrics - but only for benchmark dates that are less than or equal to the previous dates.

So a straight forward answer would be using left_join() and mutate(perc_benchmark = metric_a / benchmark_metric_a) where benchmark_metric_a is the the max value less than or equal to the metric_a date... but I'm not sure if I can do this with a left_join() or if there is a workaround.

So in the example data frames below. Player B would have a perc_benchmark of 50% (0.5/1) from 8-1 to 8-6, 8-7 would be 25% (0.5/2), and then 8-8 would be 75% (1.5/2), etc.

df <- tibble(player =  rep(LETTERS[1:2], times = 21),
             date = as.Date(rep(seq(as.Date('2022-08-01'), by = 'day', length.out= 21), each = 2)),
             # metric_a values are only patterned for testing purposes, they would not be in the actual dataframe
             metric_a = c(rep(0.5, times = 14),rep(c(0.5, 1.5), times = 7),rep(c(0.5, 3), times = 7)))

benchmark_df <- tibble(player = rep(LETTERS[1:2], times = 3),
                       date = as.Date(rep(c('2022-08-01', '2022-08-07', '2022-08-16'), each = 2)),
                       benchmark_metric_a = c(1,1,0.5,2,1,3))

CodePudding user response：

The dplyr package is developing support for non-equi *_join()s.

...using left_join()...where benchmark_metric_a is the the max value less than or equal to the metric_a date...

In the meantime, here's a solution that uses a non-equi join with the data.table package, whose performance is superior at scale.

Solution

First prepare your datasets as data.tables.

library(data.table)


# ...
# Code to generate 'df' and 'benchmark_df'.
# ...


# Convert datasets into 'data.table's.
setDT(df)
setDT(benchmark_df)

Then apply the following data.table "chain", which is analogous to the dplyr workflow:

# Perform a LEFT JOIN of 'df' to 'benchmark_df'...
result <- benchmark_df[df, .(  
    # ...using a subset of columns...
    player,
    bench_date = x.date,
    benchmark_metric_a,
    metric_date = i.date,
    metric_a
  ),
  # ...where players match and benchmark dates are earlier (or concurrent). 
  on = .(player, date <= date)
  
# Calculate the max benchmark...
][, .(
    benchmark_metric_a = max(benchmark_metric_a),
    # ...while preserving the metric...
    metric_a = first(metric_a)
  ),
  # ...for each player as of each date.
  by = .(player, metric_date)
  
# Calculate the percentage.
][, `:=`(
  perc_benchmark = metric_a / benchmark_metric_a
  
# Select the desired columns.
)][, .(
  player,
  date = metric_date,
  metric_a,
  perc_benchmark
  
# Restore original ordering from 'df': sort by 'date' and then 'player'.
)][order(
  date,
  player
)]

Result

Given a df and benchmark_df like your examples

library(dplyr)

df <- tibble(
  player =  rep(LETTERS[1:2], times = 21),
  date = as.Date(rep(seq(as.Date('2022-08-01'), by = 'day', length.out= 21), each = 2)),
  # metric_a values are only patterned for testing purposes, they would not be in the actual dataframe
  metric_a = c(rep(0.5, times = 14),rep(c(0.5, 1.5), times = 7),rep(c(0.5, 3), times = 7))
)

benchmark_df <- tibble(
  player = rep(LETTERS[1:2], times = 3),
  date = as.Date(rep(c('2022-08-01', '2022-08-07', '2022-08-16'), each = 2)),
  benchmark_metric_a = c(1,1,0.5,2,1,3)
)

this chain should yield the following data.table for result, where the df data is now augmented with the perc_benchmark column:

    player       date metric_a perc_benchmark
 1:      A 2022-08-01      0.5           0.50
 2:      B 2022-08-01      0.5           0.50
 3:      A 2022-08-02      0.5           0.50
 4:      B 2022-08-02      0.5           0.50
 5:      A 2022-08-03      0.5           0.50
 6:      B 2022-08-03      0.5           0.50
 7:      A 2022-08-04      0.5           0.50
 8:      B 2022-08-04      0.5           0.50
 9:      A 2022-08-05      0.5           0.50
10:      B 2022-08-05      0.5           0.50
11:      A 2022-08-06      0.5           0.50
12:      B 2022-08-06      0.5           0.50
13:      A 2022-08-07      0.5           0.50
14:      B 2022-08-07      0.5           0.25
15:      A 2022-08-08      0.5           0.50
16:      B 2022-08-08      1.5           0.75
17:      A 2022-08-09      0.5           0.50
18:      B 2022-08-09      1.5           0.75
19:      A 2022-08-10      0.5           0.50
20:      B 2022-08-10      1.5           0.75
21:      A 2022-08-11      0.5           0.50
22:      B 2022-08-11      1.5           0.75
23:      A 2022-08-12      0.5           0.50
24:      B 2022-08-12      1.5           0.75
25:      A 2022-08-13      0.5           0.50
26:      B 2022-08-13      1.5           0.75
27:      A 2022-08-14      0.5           0.50
28:      B 2022-08-14      1.5           0.75
29:      A 2022-08-15      0.5           0.50
30:      B 2022-08-15      3.0           1.50
31:      A 2022-08-16      0.5           0.50
32:      B 2022-08-16      3.0           1.00
33:      A 2022-08-17      0.5           0.50
34:      B 2022-08-17      3.0           1.00
35:      A 2022-08-18      0.5           0.50
36:      B 2022-08-18      3.0           1.00
37:      A 2022-08-19      0.5           0.50
38:      B 2022-08-19      3.0           1.00
39:      A 2022-08-20      0.5           0.50
40:      B 2022-08-20      3.0           1.00
41:      A 2022-08-21      0.5           0.50
42:      B 2022-08-21      3.0           1.00
    player       date metric_a perc_benchmark

Note

If you want to convert this back into tibble form, then simply use as_tibble() on the result.