Optimizing a Left Join on inequality-CodePudding

I have 2 tables, table 1 with policies and their effective dates, table with 2 with the effective dates and a "factor". Want to pull table 1 policies with their latest factors. All policies should have factors, but in the off chance they don't, a null value should be returned.

Table 1 example

State	Policy	Effective_date
Alabama	P15001	1/4/2021
Alabama	P15002	2/8/2022
Arizona	P15004	3/7/2018

... (1M rows)

Table 2 example

State	Effective_date	Factor
Alabama	1/1/2018	1.345
Alabama	7/1/2020	1.143
Alabama	10/1/2021	1.099
Arizona	1/1/2017	0.899

...

Want:

State	Policy	Policy_Effective_date	Factor	t2_effective_date
Alabama	P15001	1/4/2021	1.143	7/1/2020
Alabama	P15002	2/8/2022	1.099	10/1/2021
Arizona	P15004	3/7/2018	0.899	1/1/2017

column 5 is nice to have but not 100% necessary.

(All these data tables/numbers are made up)

The current query is something like this:

create table want as 

select t1.state, t1. policy, t1.effective_date, max(t2.effective_date)  as t2_effective_date 

from t1 left join t2  

on t1.state = t2.state 

and t1.Effective_date >= t2.Effective_date

group by t1.state, t1.policy

and then the t2_effective_date is joined to t2 to get the factor.

It works but is quite inefficient (takes a long time). I also don't like the inequality in the join, but couldn't come up with anything better myself. Are there better ways than this? Any solution is fine, if creating new helper tables, new columns are needed that's fine. More code is fine, if it makes it more efficient.

I'm using SAS SQL. Thanks in advance!

CodePudding user response：

You do not need "join" the tables. Just interleave them and remember the last time the factor changed.

First let's convert your listing into actual datasets.

data t1 ;
  input State :$20. Policy :$10. Effective_date :mmddyy.;
  format effective_date yymmdd10.;
cards;
Alabama P15001 1/4/2021
Alabama P15002 2/8/2022
Arizona P15004 3/7/2018
;

data t2;
  input State :$20. Effective_date :mmddyy. Factor ;
  format effective_date yymmdd10.;
cards;
Alabama 1/1/2018 1.345
Alabama 7/1/2020 1.143
Alabama 10/1/2021 1.099
Arizona 1/1/2017 0.899
;

Now use SET with BY to interleave the observations by STATE and DATE.

data want ;
  set t2(in=in2) t1(in=in1);
  by state effective_date ;
  retain factor_effective_date factor2;
  format factor_effective_date yymmdd10.;
  if first.state then call missing(of factor_effective_date factor2);
  if in2 then do;
     Factor_effective_date=effective_date;
     Factor2 = factor;
  end;
  if in1 ;
  drop factor ;
  rename factor2=Factor effective_date=Policy_effective_date;
run;

Results:

                   Policy_                 factor_
                  effective_              effective_
Obs     State        date       Policy       date       Factor

 1     Alabama    2021-01-04    P15001    2020-07-01     1.143
 2     Alabama    2022-02-08    P15002    2021-10-01     1.099
 3     Arizona    2018-03-07    P15004    2017-01-01     0.899

CodePudding user response：

Slightly different approach here, modifying table 2 to have a start/end date and then joining using a BETWEEN.

data t1 ;
  input State :$20. Policy :$10. Effective_date :mmddyy.;
  format effective_date yymmdd10.;
cards;
Alabama P15001 1/4/2021
Alabama P15002 2/8/2022
Arizona P15004 3/7/2018
;

data t2;
  input State :$20. Effective_date :mmddyy. Factor ;
  format effective_date yymmdd10.;
cards;
Alabama 1/1/2018 1.345
Alabama 7/1/2020 1.143
Alabama 10/1/2021 1.099
Arizona 1/1/2017 0.899
;

proc sort data=t2;
by state effective_date;
run;

data t2_start_end;
merge t2 t2(firstobs=2 rename=effective_date =end_date drop = factor);
by state ;

end_date = end_date - 1;
if last.state then end_date=today();
rename effective_date = start_date;
run;

proc sql;
create table want as
select t1.*, t2.factor, t2.start_date as policy_factor_start
from t1 
left join t2_start_end as t2
on t1.state=t2.state and t1.effective_date between t2.start_date and t2.end_date
order by 1, 2, 3;
quit;