I have a list of people, and am trying to pull out those people that come up more frequently than average (we'll start with 3 or more times). Simplified data set looks like this:
Person_ID | other value |
---|---|
1 | irrelevantvalue |
1 | irrelevant_value |
1 | irrelevant_value |
1 | irrelevant_value |
2 | irrelevant_value |
2 | irrelevant_value |
3 | irrelevant_value |
4 | irrelevant_value |
4 | irrelevant_value |
4 | irrelevant_value |
The output table would be the following, keeping only the values where this person appeared 3 or more times:
Person_ID | other value |
---|---|
1 | irrelevantvalue |
1 | irrelevant_value |
1 | irrelevant_value |
1 | irrelevant_value |
4 | irrelevant_value |
4 | irrelevant_value |
4 | irrelevant_value |
Thoughts on how to proceed here? Using SAS EG
CodePudding user response:
You can do this with SQL. Filter only to irrelevant values with a where
statement, then count all groups where the total number of irrelevant values is >= 3.
Example data:
data have;
length group 8. value $15.;
input group value$;
datalines;
1 relevant
1 relevant
1 irrelevant
1 irrelevant
1 irrelevant
2 irrelevant
2 relevant
2 relevant
3 irrelevant
3 relevant
3 relevant
3 irrelevant
3 irrelevant
4 irrelevant
4 irrelevant
4 irrelevant
;
run;
Code:
proc sql;
create table irrelevant_values as
select *
from have
where value = 'irrelevant'
group by id
having count(*) GE 3
;
quit;
Output:
group value
1 irrelevant
1 irrelevant
1 irrelevant
3 irrelevant
3 irrelevant
3 irrelevant
4 irrelevant
4 irrelevant
4 irrelevant
CodePudding user response:
You can select the ids having a frequency greater than the average frequency by using sub-queries.
Example:
data have(keep=id rownum);
input id times;
do _n_ = 1 to times; rownum 1; output; end;
datalines;
1 4
2 2
3 1
4 3
;
proc sql;
create table want as
select
have.id, rownum
from
have
, (select mean(freq) as avg_freq
from (select id, count(*) as freq from have group id)
)
group
have.id
having
count(*) > avg_freq
;