I have table in SAS Enterprise Guide like below.
Data types and meaning:
- ID - numeric - ID of client
- DT - date - date of change
- OFFER_1 - charcter - current offer
- OFFER_2 - character - offer after change
Values in original dataset are not sorted, but it can be sorted if it is important for solution.
ID | DT | OFFER_1 | OFFER_2
-----|-----------|----------|----------
123 | 01MAY2020 | PR | PR
123 | 05MAY2020 | PR | P
123 | 10MAY2020 | P | P
123 | 11MAY2020 | P | P
123 | 20MAY2020 | P | PR
123 | 21MAY2020 | PR | M
123 | 25MAY2020 | M | M
777 | 30MAY2020 | PR | M
223 | 02JAN2020 | PR | PR
223 | 15MAR2020 | PR | PR
402 | 20MAR2020 | M | M
33 | 11AUG2020 | M | PR
11 | 20JAN2020 | PR | M
11 | 05FEB2020 | M | M
And I need to create new column "COL1" where will be information:
- If client change offer from PR to P or M then count how many maximum days he lasted before returning to PR again, or how many days he is already on P or M after the change, if he has not returned to PR
- If he did not change PR to P or M "COL1" = 0
So, as a result I need something like below:
ID | DT | OFFER_1 | OFFER_2 | COL1
-----|-----------|----------|----------|---------
123 | 01MAY2020 | PR | PR | 15
123 | 05MAY2020 | PR | P | 15
123 | 10MAY2020 | P | P | 15
123 | 11MAY2020 | P | P | 15
123 | 20MAY2020 | P | PR | 15
123 | 21MAY2020 | PR | M | 15
123 | 25MAY2020 | M | M | 15
777 | 30MAY2020 | PR | M | 1
223 | 02JAN2020 | PR | PR | 0
223 | 15MAR2020 | PR | PR | 0
402 | 20MAR2020 | M | M | 0
33 | 11AUG2020 | M | PR | 0
11 | 20JAN2020 | PR | M | 16
11 | 05FEB2020 | M | M | 16
Because:
- ID = 123, has 15 in "COL1" - because changed PR to P or M and lasted a maximum of 15 days (from 05MAY2020 till 20MAY2020) before return to PR again, changed PR to P or M 2 times but the 2nd time the change lasted 4 days and 15 > 4
- ID = 777, has 1 in "COL1" - because changed PR to P or M and lasted a maximum of 1 day (30MAY2020)
- ID = 223, has 0 in "COL1" - because he did not change PR to P or M
- ID = 402, has 0 in "COL1" - because he did not change PR to P or M
- ID = 33, has 0 in "COL1" - because he did not change PR to P or M
- ID = 11, has 16 in "COL1" - because changed PR to P or M and lasted a maximum of 16 days (from 20JAN2020 till 05FEB2020) and did not return to PR again
@Stu Sztukowski - Below exaplmes from my real data where your code does not work. I add also how it should be :)
Examples where code works wrong:
ID | DT | OFFER_1 | OFFER_2 | COL1 |
---|---|---|---|---|
1020 | 01SEP2020 | P | P | 1 -> number of days between 01SEP2020 and 02SEP2020 but there was not changes from PR to P or M, so this client does not interes for us, it should be 0 |
1020 | 02SEP2020 | P | P | 1 |
2030 | 29JUL2022 | P | P | 50 -> number of days between 20JUL2022 and 17SEP2022 but there was not changes from PR to P or M, so this client does not interes for us, it should be 0 |
2030 | 15JUL2022 | P | P | 50 |
2030 | 17SEP2022 | P | P | 50 |
How it should be: Because only change from PR to P or M are interested for us, example at the bottom of this post presents it. Below examples are similar like my examples ID = 223 or 402 from the bottom of this post :)
ID | DT | OFFER_1 | OFFER_2 | COL1 |
---|---|---|---|---|
1020 | 01SEP2020 | P | P | 0 |
1020 | 02SEP2020 | P | P | 0 |
2030 | 29JUL2022 | P | P | 0 |
2030 | 15JUL2022 | P | P | 0 |
2030 | 17SEP2022 | P | P | 0 |
How can I do that in SAS Enterprise Guide in PROC SQL or in normal SAS code ?
CodePudding user response:
If offer_1
is P
or M
, then we just need to take the difference between the current date and the previous date. We'll use some rules to calculate the differences based on your rules. From there we can join the max values back with the original data.
proc sort data=have;
by id dt;
run;
data date_difs;
set have;
by id dt;
retain flag_pr_pm start_dt;
if(first.id) then call missing(flag_pr_pm, start_dt);
if(offer_1 = 'PR' AND offer_2 IN('P', 'M') ) then do;
flag_pr_pm = 1;
start_dt = dt;
end;
if( (offer_1 IN('P', 'M') AND offer_2 = 'PR')
OR (flag_pr_pm AND last.id AND NOT first.id)
)
then do;
flag_pr_pm = 0;
total_days = dt - start_dt;
end;
if( first.id
AND last.id
AND offer_1 = 'PR'
AND offer_2 IN ('P', 'M')
)
then total_days = 1;
format start_dt date9.;
run;
Which gets you:
id dt offer_1 offer_2 flag_pr_pm start_dt total_days
11 20JAN2020 PR M 1 20JAN2020 .
11 05FEB2020 M M 0 20JAN2020 16
33 11AUG2020 M PR 0 . .
123 01MAY2020 PR PR . . .
123 05MAY2020 PR P 1 05MAY2020 .
123 10MAY2020 P P 1 05MAY2020 .
123 11MAY2020 P P 1 05MAY2020 .
123 20MAY2020 P PR 0 05MAY2020 15
123 21MAY2020 PR M 1 21MAY2020 .
123 25MAY2020 M M 0 21MAY2020 4
223 02JAN2020 PR PR . . .
223 15MAR2020 PR PR . . .
402 20MAR2020 M M . . .
777 30MAY2020 PR M 1 30MAY2020 1
1020 01SEP2020 P P . . .
1020 02SEP2020 P P . . .
2030 15JUL2022 P P . . .
2030 29JUL2022 P P . . .
2030 17SEP2022 P P . . .
Now join the max value by ID back to the original data:
proc sql;
create table want as
select *, col1
from have as t1
LEFT JOIN
(select id, max(total_days) as col1
from date_difs
group by id
) as t2
ON t1.id = t2.id
;
quit;
This gives you the exact solution within your sample data.