Home > database >  SAS: How to calculate date differences between rows in grouped data
SAS: How to calculate date differences between rows in grouped data

Time:04-21

My dataset have in some instances been split into multiple rows, and I need to find these instances and recombine the rows. This requires me to calculate the difference between dates in a dataset within a group and subgroup, and then effectively "merge the rows" if the differences between the Start and End dates are 1 day or below within. My dataset looks like this:

ID  |      Start      |        End         |     Place   |
 1  |   01-01-2020    |     31-03-2020     |   Street 1  |
 1  |   01-04-2020    |     31-07-2020     |   Street 1  |
 1  |   01-08-2020    |     31-12-2020     |   Street 1  |
 1  |   01-01-2021    |     31-03-2021     |   Street 2  |
 2  |   01-01-2020    |     31-04-2020     |   Street 1  |
 2  |   31-04-2020    |     31-08-2020     |   Street 1  |
 3  |   01-01-2020    |     31-03-2020     |   Street 1  |

And I would really like to output this:

ID  |      Start      |        End         |     Place    |
 1  |   01-01-2020    |     31-12-2020     |   Street 1   |
 1  |   01-01-2021    |     31-03-2021     |   Street 2   |
 2  |   01-01-2020    |     31-08-2020     |   Street 1   |
 3  |   01-01-2020    |     31-03-2020     |   Street 1   |

So essentially, within the ID group and Place subgroup, if there is a 1 or smaller difference between the previous End and new Start date, then I would like to "combine" the two rows so the start and end date reflect the entire period at that particular place (for instance entire stay at Street 1 for ID 1).

I have tried creating multiple datasteps where I use the LAG-function a lot, and that seems to deal reasonable well with the instance of ID 2, where there are only two rows that need to be considered. However, for ID 1 (Street 1), where I effectively need to join three rows, I have not been able to find a good solution as of yet. Any suggestions of functions that could be usefull will be much appreciated!

CodePudding user response:

You can use a multiple steps approach.
Thanks to @whymath for spotting the dates issues in the input data.

data stage1;
    set have;
    by id place notsorted;

    if first.place then
        group_number 1;
run;

proc sort data=stage1 out=stage2;
    by id place group_number start;
run;

data want;
    do until (last.place);
        set stage2;
        by id place group_number;

        if first.group_number then
            _start=start;

        if last.place then
            do;
                start=_start;
                output;
            end;
    end;
    drop _start group_number;
run;
 id   start       end     place
 1 01-01-2020 31-12-2020 Street1
 1 01-01-2021 31-03-2021 Street2
 2 01-01-2020 31-08-2020 Street1
 3 01-01-2020 31-03-2020 Street1

CodePudding user response:

There some errors in your data, I correct them to get your desired output:
Line5: 31-04-202030-04-2020
Line6: 31-04-202001-05-2020

data have;
  infile cards dlm='|';
  input id start: ddmmyy10. end: ddmmyy10. place$;
  format start end ddmmyyd10.;
  cards;
 1  |   01-01-2020    |     31-03-2020     |   street 1  |
 1  |   01-04-2020    |     31-07-2020     |   street 1  |
 1  |   01-08-2020    |     31-12-2020     |   street 1  |
 1  |   01-01-2021    |     31-03-2021     |   street 2  |
 2  |   01-01-2020    |     30-04-2020     |   street 1  |
 2  |   01-05-2020    |     31-08-2020     |   street 1  |
 3  |   01-01-2020    |     31-03-2020     |   street 1  |
 ;
run;

The difficult of your question is update and output data according next row. I would suggest you the double set skill.

proc sort data=have;
  by id place start;
run;

data want;
  set have;

  do i=1 to rec;
    set have(rename=(id=tmpid place=tmpplace start=tmpstart end=tmpend))nobs=rec point=i;
    if id=tmpid and place=tmpplace then do;
      if start=tmpend 1 and _n_=i 1 then used=1;
      else if end=tmpstart-1 and _n_<i then end=tmpend;
    end;
  end;

  if used^=1;
  drop tmp: used;
run;
  • Related