How to join two databases using proc sql using two variables-CodePudding

I am trying to join two datasets based on variables office_id and office_id_flag, each with 50,000 observations.

data1 variables

grp
age
hispanic
id
race
sex
mental_ill
mental_ill_dx
office_id
office_id_flag

data2 variables

er_vis
adm_hr
psych_hosp
psych_vis
region
region_rpc
charges
office_id
office_id_flag

I first tried this in data step merge procedure because I know how that works, and the resulting dataset had 50,000 observations and 17 variables, which makes sense to me. Merge code:

proc sort data=data1; by office_id; run;
proc sort data=data2; by office_id; run;

data work.merge_datastep;
    merge data1 (in=dem) data2 (in=hosp);
    by office_id;
    if dem and hosp;
run;
*53000 observations and 17 variables;

I've tried so many SQL merge types and each one produces a dataset with 56000 observations and 17 variables, which doesn't make sense.

Here's what I've tried in SQL:

proc sql;
create table sql_outerjoin as 
 select * 
 from data1 full outer join data2 on data1.office_id=data2.office_id;
quit;


proc sql;
create table sql_leftjoin as 
 select * from data1 left outer join data2 on data1.office_id=data2.office_id;
quit;

proc sql;
create table work.sql_innerjoin as
select data1.*, data2.*
from work.data1, work.data2
where data1.office_id=data2.office_id;
quit; 

proc sql ;
create table sql_try1 as 
select one.*, two.*
from data1 as one
left join data2 as two
on (one.office_id = two.office_id and one.office_id_flag= two.office_id_flag);
quit;


proc sql;
create table sql_try3 as
select coalesce(a.office_id, b.office_id) as ID
from data1 a
full join data2 b
on a.ID = b.ID;
quit;

proc sql;
  create table sparcs_1 as
  select * 
  from data1, data2
  where data1.office_id=data2.office_id;
quit;


proc sql;
create table work.sql_leftjoin2 as
select s.*, d.*
from work.data1 as s left join work.data2 as d
on s.office_id=d.office_id;
quit;

Additionally, every one of these proc sql tries has resulted in these errors:

WARNING: Variable office_id already exists on file WORK.whatever_table_name.
WARNING: Variable office_id_flag already exists on file WORK.whatever_table_name.

I'd like the final result to be identical to the datastep merge procedure and have all variables and 50,000 observations, matched on office_id and office_id_flag.

I'm out of my depth and am obviously shooting in the dark with this. Does anyone have any suggestions?

CodePudding user response：

If the number of observations is not increasing on data step merge, but is increasing in SQL joins, then you have data that is not unique on the merge/join key. SAS will not increase the number of rows in a merge when there are duplicate rows - it also probably won't do what you want, but it might do something tolerable. SQL, however, will produce (depending on how you do it) extra rows for each combination.

How to solve this? Make sure you have a unique join key where no pair (or more) of rows share the same value in both datasets. Or, consider the right join type for the data you have - it might require summarizing the data.

If the data actually are unique, it's possible you just didn't include the exact combination you wanted.

proc sql;
  create table sparcs_1 as
  select * 
  from data1, data2
  where data1.office_id=data2.office_id
    and data1.office_id_flag=data2.office_id_flag
  ;
quit;

proc sql;
  create table sparcs_1 as
  select * 
  from data1 inner join data2
  on data1.office_id=data2.office_id
    and data1.office_id_flag=data2.office_id_flag
  ;
quit;

Those both require the row to match on both variables and come from both tables. But that's not what your SAS data step merge is doing - it's only depending on office_id, so it might give a different result. Odds are you cannot replicate this join in SQL perfectly - without adding some sort of row numbering - because the way SAS does the merge is just different, and not usually something you'd actually want to do in SQL.

Second, your warnings are because you're doing things like this:

select a.*, b.*

If the same variable is in both datasets, then you're asking for it twice. Hence the warning. It's actually not a problem if the join key is the only variable that overlaps - then the result is what you want, just with the warning - but you can remove it by explicitly listing the variables you want from at least one of the tables. Best practice is to not use * at all, but it's understandable particularly when one table has a lot of variables.

CodePudding user response：

Thanks, Joe! Your comments were very helpful.

I ended up adding a row number based on observation row (from this previously asked question https://communities.sas.com/t5/SAS-Procedures/PROC-SQL-QUESTION-How-to-add-a-row-number-to-a-table-observation/td-p/167770), and testing the resulting dataset in both SAS and proc sql.

Code here:

proc sql;
  create table rows1 as
    select monotonic() as row, *
      from work.data1;
quit;

proc sql; 
    create table rows2 as 
        select monotonic() as row, *
            from work.data2;
quit;

proc sql ;
create table sql_rowmerge1 as 
select rows1.*, row2.er_vis, row2.adm_hr, row2.psych_hosp, row2.psych_vis, row2.region, row2.region_rpc, row2.charges
from rows1
left join rows2
on (rows1.row = rows2.row and rows1.office_id = rows2.office_id and rows1.office_id_flag= rows2.office_id_flag);
quit;

As you suggested, basing the merge on the 3 conditional variables, instead of just the 2 (probably not unique) variables, was what pulled both datasets together in the appropriate way. The resulting dataset now has the correct 50,000 observations.

Thanks again - This was great! I learned a new method!