I have a dataset and I am looking to see if there is a way to match data based on col values.
col-A col-B
Apple squash
Apple lettuce
Banana Carrot
Banana Carrot
Banana Carrot
dragon turnip
melon potato
melon potato
pear potato
Match
- if col A matches another col a and col b doesn't match
- if col B matches another col B and col a doesn't match
col-A col-B
Apple squash
Apple lettuce
melon potato
melon potato
pear potato
edit fixed typo
edit2 fixed 2nd typo
CodePudding user response:
IIUC, you need to compute two masks to identify which group has a unique match with the other values:
m1 = df.groupby('col-B')['col-A'].transform('nunique').gt(1)
m2 = df.groupby('col-A')['col-B'].transform('nunique').gt(1)
out = df[m1|m2]
Output:
col-A col-B
0 Apple squash
1 Apple lettuce
6 melon potato
7 melon potato
8 pear potato
You can also get the unique/exclusive pairs with:
df[~(m1|m2)]
col-A col-B
2 Banana Carrot
3 Banana Carrot
4 Banana Carrot
5 Pear Cabbage
CodePudding user response:
So, if I understand well, you want to select each rows, such that grouping for colA (resp. colB) then colB (resp. colA) lead to more than one group.
I can advice :
grA = df2.groupby("colA").filter(lambda x : x.groupby("colB").ngroups > 1)
grB = df2.groupby("colB").filter(lambda x : x.groupby("colA").ngroups > 1)
Leading to :
grA
colA colB
0 Apple squash
1 Apple lettuce
and
grB
colA colB
6 melon potato
7 melon potato
8 pear potato
Merging the two dataframes will lead to the desired ouput.