I've been trying to solve something using Pandas for a few days now, but I feel like I'm missing something. I have 2 dataframes:
data = {'GMC1':[1, 1, 1, 2, 3, 3, 3],
'Provider1':[100, 101, 102, np.NaN, 104, 105, 106]}
dataframe1 = pd.DataFrame(data)
dataframe1
output:
GMC1 Provider 1
1 100
1 101
1 102
2 NaN
3 104
3 105
3 106
and
data2 = {'GMC2':[1, 2, 3, 3, 3],
'Provider2':[101, 100, 104, 105, 107]}
dataframe2 = pd.DataFrame(data2)
dataframe2
output:
GMC2 Provider2
1 101
2 100
3 105
3 104
3 107
I would like a dataframe returned which joins on GMC1 = GMC2, and brings back the Provider2 rows where they are not present in Provider1, for the same values of GMC1 and GMC2.
Expected output:
GMC1 GMC2 Provider 2
2 2 100
3 3 107
I have tried various approaches using joins, like this:
(dataframe1.merge(dataframe2, left_on='GMC1', right_on='GMC2',
how='right')
.query('Provider1 != Provider2')
)
But they don't quite bring back what I want. I'm aware this is a bit wordy, so I'm very happy to elaborate.
Many thanks in advance for your help!
CodePudding user response:
One way would be to create a helper column on both dataframes, let's call it ['GP']
:
dataframe2['Provider2'] = dataframe2['Provider2'].astype(float) # setting this to float so the merge will work
dataframe1['GP'] = dataframe1[['GMC1', 'Provider1']].astype(str).agg(' '.join, axis=1)
dataframe2['GP'] = dataframe2[['GMC2', 'Provider2']].astype(str).agg(' '.join, axis=1)
Then, if we merge on this column and exclude NAs:
dataframe1.merge(dataframe2, on='GP', how='outer').query('GMC1.isna()')
Will return:
GMC1 Provider1 GP GMC2 Provider2
7 NaN NaN 2 100.0 2.0 100.0
8 NaN NaN 3 107.0 3.0 107.0
From here, you easily can reach your desired solution with a bit of wrangling, running for example:
result = dataframe1.merge(dataframe2, on='GP', how='outer').query('GMC1.isna()')
result['GMC1'] = result['GP'].apply(lambda x: x.split(' ')[0])
result = result.drop(columns=['GP', 'Provider1'])
Will result in your desired dataframe:
GMC1 GMC2 Provider2
7 2 2.0 100.0
8 3 3.0 107.0
Sorry that the answer is a bit wordy, but hope it helps!
CodePudding user response:
The merge is indeed the way to go. But you must do the query in a groupby, to only compare providers for the same GMC value:
dataframe1.merge(dataframe2, left_on='GMC1', right_on='GMC2').groupby('GMC1').apply(
lambda df: df[~df['Provider2'].isin(df['Provider1'].values)]
)[['GMC1', 'GMC2', 'Provider2']].drop_duplicates().reset_index(drop=True)
Gives as expected
GMC1 GMC2 Provider2
0 2 2 100
1 3 3 107
CodePudding user response:
Use outer join by both columns and indicator parameter and filter by this new column _merge
:
df = dataframe1.merge(dataframe2,
left_on=['GMC1', 'Provider1'],
right_on=['GMC2', 'Provider2'],
how='right',
indicator=True)
df = (df.loc[df['_merge'].eq('right_only'), ['GMC1','GMC2','Provider2']]
.assign(GMC1 = lambda x: x['GMC2']))
print (df)
GMC1 GMC2 Provider2
1 2 2 100
4 3 3 107