Home > Mobile >  Pandas Dataframe Merge Where 1 Column Matches, but Another Column's Values are not Present
Pandas Dataframe Merge Where 1 Column Matches, but Another Column's Values are not Present

Time:11-03

I've been trying to solve something using Pandas for a few days now, but I feel like I'm missing something. I have 2 dataframes:

data = {'GMC1':[1, 1, 1, 2, 3, 3, 3],
        'Provider1':[100, 101, 102, np.NaN, 104, 105, 106]}

dataframe1 = pd.DataFrame(data)

dataframe1

output:

GMC1    Provider 1
1       100
1       101
1       102
2       NaN
3       104
3       105
3       106

and

data2 = {'GMC2':[1, 2, 3, 3, 3],
         'Provider2':[101, 100, 104, 105, 107]}

dataframe2 = pd.DataFrame(data2)

dataframe2

output:

GMC2    Provider2
1       101
2       100
3       105
3       104
3       107

I would like a dataframe returned which joins on GMC1 = GMC2, and brings back the Provider2 rows where they are not present in Provider1, for the same values of GMC1 and GMC2.

Expected output:

GMC1    GMC2    Provider 2
2       2       100
3       3       107

I have tried various approaches using joins, like this:

(dataframe1.merge(dataframe2, left_on='GMC1', right_on='GMC2',
                  how='right')
           .query('Provider1 != Provider2')
)

But they don't quite bring back what I want. I'm aware this is a bit wordy, so I'm very happy to elaborate.

Many thanks in advance for your help!

CodePudding user response:

One way would be to create a helper column on both dataframes, let's call it ['GP']:

dataframe2['Provider2'] = dataframe2['Provider2'].astype(float)  # setting this to float so the merge will work
dataframe1['GP'] = dataframe1[['GMC1', 'Provider1']].astype(str).agg(' '.join, axis=1)
dataframe2['GP'] = dataframe2[['GMC2', 'Provider2']].astype(str).agg(' '.join, axis=1)

Then, if we merge on this column and exclude NAs:

dataframe1.merge(dataframe2, on='GP', how='outer').query('GMC1.isna()')

Will return:

    GMC1    Provider1   GP     GMC2 Provider2
7   NaN     NaN      2 100.0    2.0     100.0
8   NaN     NaN      3 107.0    3.0     107.0

From here, you easily can reach your desired solution with a bit of wrangling, running for example:

result = dataframe1.merge(dataframe2, on='GP', how='outer').query('GMC1.isna()')

result['GMC1'] = result['GP'].apply(lambda x: x.split(' ')[0])

result = result.drop(columns=['GP', 'Provider1'])

Will result in your desired dataframe:

    GMC1    GMC2    Provider2
7   2       2.0         100.0
8   3       3.0         107.0

Sorry that the answer is a bit wordy, but hope it helps!

CodePudding user response:

The merge is indeed the way to go. But you must do the query in a groupby, to only compare providers for the same GMC value:

dataframe1.merge(dataframe2, left_on='GMC1', right_on='GMC2').groupby('GMC1').apply(
    lambda df: df[~df['Provider2'].isin(df['Provider1'].values)]
    )[['GMC1', 'GMC2', 'Provider2']].drop_duplicates().reset_index(drop=True)

Gives as expected

   GMC1  GMC2  Provider2
0     2     2        100
1     3     3        107

CodePudding user response:

Use outer join by both columns and indicator parameter and filter by this new column _merge:

df = dataframe1.merge(dataframe2, 
                      left_on=['GMC1', 'Provider1'],
                      right_on=['GMC2', 'Provider2'], 
                      how='right', 
                      indicator=True)

df = (df.loc[df['_merge'].eq('right_only'), ['GMC1','GMC2','Provider2']]
        .assign(GMC1 = lambda x: x['GMC2']))

print (df)
   GMC1  GMC2  Provider2
1     2     2        100
4     3     3        107
  • Related