My dataset have in some instances been split into multiple rows, and I need to find these instances and recombine the rows. This requires me to calculate the difference between dates in a dataset within a group and subgroup, and then effectively "merge the rows" if the differences between the Start and End dates are 1 day or below within. My dataset looks like this:
ID | Start | End | Place |
1 | 01-01-2020 | 31-03-2020 | Street 1 |
1 | 01-04-2020 | 31-07-2020 | Street 1 |
1 | 01-08-2020 | 31-12-2020 | Street 1 |
1 | 01-01-2021 | 31-03-2021 | Street 2 |
2 | 01-01-2020 | 31-04-2020 | Street 1 |
2 | 31-04-2020 | 31-08-2020 | Street 1 |
3 | 01-01-2020 | 31-03-2020 | Street 1 |
And I would really like to output this:
ID | Start | End | Place |
1 | 01-01-2020 | 31-12-2020 | Street 1 |
1 | 01-01-2021 | 31-03-2021 | Street 2 |
2 | 01-01-2020 | 31-08-2020 | Street 1 |
3 | 01-01-2020 | 31-03-2020 | Street 1 |
So essentially, within the ID group and Place subgroup, if there is a 1 or smaller difference between the previous End and new Start date, then I would like to "combine" the two rows so the start and end date reflect the entire period at that particular place (for instance entire stay at Street 1 for ID 1).
I have tried creating multiple datasteps where I use the LAG-function a lot, and that seems to deal reasonable well with the instance of ID 2, where there are only two rows that need to be considered. However, for ID 1 (Street 1), where I effectively need to join three rows, I have not been able to find a good solution as of yet. Any suggestions of functions that could be usefull will be much appreciated!
CodePudding user response:
You can use a multiple steps approach.
Thanks to @whymath for spotting the dates issues in the input data.
data stage1;
set have;
by id place notsorted;
if first.place then
group_number 1;
run;
proc sort data=stage1 out=stage2;
by id place group_number start;
run;
data want;
do until (last.place);
set stage2;
by id place group_number;
if first.group_number then
_start=start;
if last.place then
do;
start=_start;
output;
end;
end;
drop _start group_number;
run;
id start end place
1 01-01-2020 31-12-2020 Street1
1 01-01-2021 31-03-2021 Street2
2 01-01-2020 31-08-2020 Street1
3 01-01-2020 31-03-2020 Street1
CodePudding user response:
There some errors in your data, I correct them to get your desired output:
Line5: 31-04-2020
→ 30-04-2020
Line6: 31-04-2020
→ 01-05-2020
data have;
infile cards dlm='|';
input id start: ddmmyy10. end: ddmmyy10. place$;
format start end ddmmyyd10.;
cards;
1 | 01-01-2020 | 31-03-2020 | street 1 |
1 | 01-04-2020 | 31-07-2020 | street 1 |
1 | 01-08-2020 | 31-12-2020 | street 1 |
1 | 01-01-2021 | 31-03-2021 | street 2 |
2 | 01-01-2020 | 30-04-2020 | street 1 |
2 | 01-05-2020 | 31-08-2020 | street 1 |
3 | 01-01-2020 | 31-03-2020 | street 1 |
;
run;
The difficult of your question is update and output data according next row. I would suggest you the double set skill.
proc sort data=have;
by id place start;
run;
data want;
set have;
do i=1 to rec;
set have(rename=(id=tmpid place=tmpplace start=tmpstart end=tmpend))nobs=rec point=i;
if id=tmpid and place=tmpplace then do;
if start=tmpend 1 and _n_=i 1 then used=1;
else if end=tmpstart-1 and _n_<i then end=tmpend;
end;
end;
if used^=1;
drop tmp: used;
run;