I've had a problem I've been trying to solve using pandas, but it's been vexing me for a couple of days now. I have 2 dataframes:
data = {'GMC1':[1, 3, 5, 5, 8, 9, 8, 13],
'Provider1':[101, 102, 103, 150, 160, 450, 567, 230]}
dataframe1 = pd.DataFrame(data)
data2 = {'GMC2':[1, 5, 7, 22, 11, 13, 9, 8],
'Provider2':[101, 150, 150, 670, 567, 230, 340, np.nan]}
dataframe2 = pd.DataFrame(data2)
This is a little tricky to describe, but I would like a dataframe returned where:
- Provider1 is not in provider2, where the GMC1 and GMC2 match
- If Provider2 is NaN, then count it as not matching Provider1
So, for the example data:
mergeDF = pd.merge(dataframe1, dataframe2, how = 'inner', left_on = 'GMC1', right_on = 'GMC2')
mergeDF = mergeDF[['GMC1', 'GMC2', 'Provider1', 'Provider2']]
GMC1 | GCM2 | Provider1 | Provider2 |
---|---|---|---|
1 | 1 | 101 | 101 |
5 | 5 | 103 | 150 |
5 | 5 | 150 | 150 |
8 | 8 | 160 | NaN |
8 | 8 | 567 | NaN |
9 | 9 | 450 | 340 |
13 | 13 | 230 | 230 |
- For GMC = 1, I would not want anything returned, because both the gmc and the provider values match.
- For GMC = 5, I would only want the row where provider1 = 103, because there is another row where GMC = 5 where both provider1 and provider2 have the value of 150.
- Return rows where GMC = 8 and 9, because there are not corresponding values in Provider2
I have tried merging as above on GMC, and then merging again on the provider1 and provider2 columns, but I can't get the output I want. I also tried using the isin() function with negation (~), but I couldn't get that to work either, so I'm going slightly mad...
Desired output:
GMC1 GMC2 Provider1
5 5 103
8 8 160
8 8 567
9 9 450
I hope this makes sense, but I'm very happy to clarify further if needed.
Many thanks for your help in advance
CodePudding user response:
Merge is correct, but I think you want how="left"
. Then you can query the merge:
(dataframe1.merge(dataframe2, left_on='GMC1', right_on='GMC2',
how='left')
.query('Provider1 != Provider2')
)
Output:
GMC1 Provider1 GMC2 Provider2
1 3 102 NaN NaN
2 5 103 5.0 150.0
4 8 160 8.0 NaN
5 9 450 9.0 340.0
6 8 567 8.0 NaN
CodePudding user response:
You can query the merge like this I think:
mergeDF.loc[~((mergeDF["GMC1"] == mergeDF["GMC2"]) & (mergeDF["Provider1"]==mergeDF["Provider2"]))]
GMC1 GMC2 Provider1 Provider2
1 5 5 103 150.0
3 8 8 160 NaN
4 8 8 567 NaN
5 9 9 450 340.0