I have 2 tables, table 1 with policies and their effective dates, table with 2 with the effective dates and a "factor". Want to pull table 1 policies with their latest factors. All policies should have factors, but in the off chance they don't, a null value should be returned.
Table 1 example
State | Policy | Effective_date |
---|---|---|
Alabama | P15001 | 1/4/2021 |
Alabama | P15002 | 2/8/2022 |
Arizona | P15004 | 3/7/2018 |
... (1M rows)
Table 2 example
State | Effective_date | Factor |
---|---|---|
Alabama | 1/1/2018 | 1.345 |
Alabama | 7/1/2020 | 1.143 |
Alabama | 10/1/2021 | 1.099 |
Arizona | 1/1/2017 | 0.899 |
...
Want:
State | Policy | Policy_Effective_date | Factor | t2_effective_date |
---|---|---|---|---|
Alabama | P15001 | 1/4/2021 | 1.143 | 7/1/2020 |
Alabama | P15002 | 2/8/2022 | 1.099 | 10/1/2021 |
Arizona | P15004 | 3/7/2018 | 0.899 | 1/1/2017 |
column 5 is nice to have but not 100% necessary.
(All these data tables/numbers are made up)
The current query is something like this:
create table want as
select t1.state, t1. policy, t1.effective_date, max(t2.effective_date) as t2_effective_date
from t1 left join t2
on t1.state = t2.state
and t1.Effective_date >= t2.Effective_date
group by t1.state, t1.policy
and then the t2_effective_date is joined to t2 to get the factor.
It works but is quite inefficient (takes a long time). I also don't like the inequality in the join, but couldn't come up with anything better myself. Are there better ways than this? Any solution is fine, if creating new helper tables, new columns are needed that's fine. More code is fine, if it makes it more efficient.
I'm using SAS SQL. Thanks in advance!
CodePudding user response:
You do not need "join" the tables. Just interleave them and remember the last time the factor changed.
First let's convert your listing into actual datasets.
data t1 ;
input State :$20. Policy :$10. Effective_date :mmddyy.;
format effective_date yymmdd10.;
cards;
Alabama P15001 1/4/2021
Alabama P15002 2/8/2022
Arizona P15004 3/7/2018
;
data t2;
input State :$20. Effective_date :mmddyy. Factor ;
format effective_date yymmdd10.;
cards;
Alabama 1/1/2018 1.345
Alabama 7/1/2020 1.143
Alabama 10/1/2021 1.099
Arizona 1/1/2017 0.899
;
Now use SET with BY to interleave the observations by STATE and DATE.
data want ;
set t2(in=in2) t1(in=in1);
by state effective_date ;
retain factor_effective_date factor2;
format factor_effective_date yymmdd10.;
if first.state then call missing(of factor_effective_date factor2);
if in2 then do;
Factor_effective_date=effective_date;
Factor2 = factor;
end;
if in1 ;
drop factor ;
rename factor2=Factor effective_date=Policy_effective_date;
run;
Results:
Policy_ factor_
effective_ effective_
Obs State date Policy date Factor
1 Alabama 2021-01-04 P15001 2020-07-01 1.143
2 Alabama 2022-02-08 P15002 2021-10-01 1.099
3 Arizona 2018-03-07 P15004 2017-01-01 0.899
CodePudding user response:
Slightly different approach here, modifying table 2 to have a start/end date and then joining using a BETWEEN.
data t1 ;
input State :$20. Policy :$10. Effective_date :mmddyy.;
format effective_date yymmdd10.;
cards;
Alabama P15001 1/4/2021
Alabama P15002 2/8/2022
Arizona P15004 3/7/2018
;
data t2;
input State :$20. Effective_date :mmddyy. Factor ;
format effective_date yymmdd10.;
cards;
Alabama 1/1/2018 1.345
Alabama 7/1/2020 1.143
Alabama 10/1/2021 1.099
Arizona 1/1/2017 0.899
;
proc sort data=t2;
by state effective_date;
run;
data t2_start_end;
merge t2 t2(firstobs=2 rename=effective_date =end_date drop = factor);
by state ;
end_date = end_date - 1;
if last.state then end_date=today();
rename effective_date = start_date;
run;
proc sql;
create table want as
select t1.*, t2.factor, t2.start_date as policy_factor_start
from t1
left join t2_start_end as t2
on t1.state=t2.state and t1.effective_date between t2.start_date and t2.end_date
order by 1, 2, 3;
quit;