Home > Enterprise >  SAS delete and group by
SAS delete and group by

Time:03-22

Simplified version of the dataset I have is:

DATA HAVE;
INPUT ID match1 $ match2 $ not_relevant;
DATALINES;
1 "ABC" "ABC" 4
1 "XYZ" "XYZ" 29
2 "QQQ" "AAA" 5
2 "ABC" "ABC" 9
3 "EFG" "EFG" 7
3 "DEF" "DEF" 12
3 "LMK" LMK" 16
3 "LMK" . 29


;RUN;

I am looking to compare match1 and match2, and if anywhere in the ID column match1 does not equal match2, I would like to remove all of the rows with that ID. So for this example dataset I want to remove all of ID 2 (rows 3 and 4) since row 3 does not have a match between match1 and match2. All I can figure out how to do so far is to delete the rows where they dont match, which isnt terribly helpful for this application. I assume it would be easier to make it a new data set with some wheres but I am unsure how to begin there. Any ideas / advice?

EDIT: Apologies, I dumbed down my dataset too much and forgot about an important exception. Note in my new dataset (I only added one row to the end). I do NOT want to delete group 3, since match2 is blank. I only want to delete a group where match2 is not blank and match1 does not equal match2.

Thanks

CodePudding user response:

There's a few ways to do this. One would be to just construct a dataset of IDs that have non-matching rows, then do a merge or a SQL join and remove anything that matched this list.

However, my preferred option (partly because of speed, but also it's more straightforward once you understand how it works) is the DoW loop.

data want;
  id_nonmatch = 0;
  do _n_ = 1 by 1 until (last.id);
    set have;
    by id;
    if match1 ne match2 then id_nonmatch = 1;   *set the flag to 1 if we find a nonmatch;
  end;
  
  do _n_ = 1 by 1 until (last.id);
    set have;
    by id;    
    if id_nonmatch = 0 then output;
  end;
run;

There are two set statements on the data step, each of which runs through the same dataset separately. If it doesn't make sense, throw a put _all_; inside each of the do loops - that will show you what it's doing. The first loop goes over all of the rows for one ID, checks if any violate the constraint, and if none do, the flag variable (id_nonmatch) stays 0. If one does, it becomes a 1 (and stays that way). Then, when it hits an ID boundary, it stops pulling records from the first set statement, and goes onto the second - re-pulling those same rows. Now, it outputs only when the flag is a zero.

This is very efficient because of buffering - unless your id groups are very large, the data step may be able to use buffers to keep the same rows in memory and not have to reread them from disk. (This will depend on your disk and buffers - and seems to help much less on flash than on physical disks [since there is not the additional benefit of the disk head not having to move] - so your mileage may vary here.)


Just to show this difference, here is a log showing that there isn't much additional time needed for the second read - when the record is reasonably sized. This benefit is less when the record is very small - I imagine there is more overhead involved. Note that the second read adds only 1/7 of the time of the first read to the total processing time!

 69         data have;
 70           call streaminit(7);
 71           length strvar $1000;
 72           do id = 1 to 100000;
 73             do iter = 1 to 50;
 74               x = rand('Uniform');
 75               output;
 76             end;
 77           end;
 78         run;
 
 NOTE: Variable strvar is uninitialized.
 NOTE: The data set WORK.HAVE has 5000000 observations and 4 variables.
 NOTE: DATA statement used (Total process time):
       real time           5.20 seconds
       cpu time            5.20 seconds
       
 
 79         
 80         
 81         data _null_;
 82           do _n_ = 1 by 1 until (last.id);
 83             set have;
 84             by id;
 85           end;
 86         run;
 
 NOTE: There were 5000000 observations read from the data set WORK.HAVE.
 NOTE: DATA statement used (Total process time):
       real time           2.37 seconds
       cpu time            2.37 seconds
       
 
 87         
 88         
 89         data _null_;
 90           do _n_ = 1 by 1 until (last.id);
 91             set have;
 92             by id;
 93           end;
 94           do _n_ = 1 by 1 until (last.id);
 95             set have;
 96             by id;
 97           end;
 98         run;
 
 NOTE: There were 5000000 observations read from the data set WORK.HAVE.
 NOTE: There were 5000000 observations read from the data set WORK.HAVE.
 NOTE: DATA statement used (Total process time):
       real time           2.74 seconds
       cpu time            2.73 seconds

CodePudding user response:

It is easy to do this with an SQL query with a GROUP BY and HAVING clause.

proc sql;
create table want as
  select * 
  from have
  group by id
  having max( (match1 ne match2) and not missing(match2))
;
quit;

SAS evaluates boolean expressions as 1/0 for TRUE/FALSE so the MAX() of a series of TRUE/FALSE values will be TRUE if ANY of them are TRUE.

  • Related