How to use a pandas groupby to filter this dataframe?-CodePudding

Using Python how can you use a group-by to filter this dataset

Start

First     Last     Location      ID1              ID2             First3   Last3
John      Smith    Toronto       JohnToronto      SmithToronto    Joh      Smi
Joh       Smith    Toronto       JohToronto       SmithToronto    Joh      Smi
Steph     Sax      Vancouver     StephVancouver   SaxVancouver    Ste      Sax
Steph     Sa       Vancouver     StephVancouver   SaxeVancouver   Ste      Sax 
Stacy     Lee      Markham       StacyMarkham     LeeMarkham      Sta      Lee
Stac      Lee      Markham       StacMarkham      LeeMarkham      Sta      Lee
Stac      Wong     Aurora        StacAurora       LeeAurora       Sta      Won
Stac      Lee      Newmarket     StacNewmarket    LeeNewmarket    Sta      Lee
Steve     Smith    Toronto       SteveToronto     SmithToronto    Ste      Smi
John      Jones    Toronto       JohnToronto      JonesToronto    Joh      Jon

How can I make it so that where either the two conditions are accepted, filtering everything else that doesn't meet these two criteria

ID1 - Matches another ID1 and the Last3 are the same
ID2 - Matches another ID2 and the First 3 are the same

End

 First     Last     Location      ID1              ID2             First3   Last3
 John      Smith    Toronto       JohnToronto      SmithToronto    Joh      Smi
 Joh       Smith    Toronto       JohToronto       SmithToronto    Joh      Smi
 Steph     Sax      Vancouver     StephVancouver   SaxVancouver    Ste      Sax
 Steph     Sa       Vancouver     StephVancouver   SaxeVancouver   Ste      Sax 
 Stacy     Lee      Markham       StacyMarkham     LeeMarkham      Sta      Lee
 Stac      Lee      Markham       StacMarkham      LeeMarkham      Sta      Lee

CodePudding user response：

Based on comment for clarification of the problem statement -

trying to groupby ID1 or ID2. And then depending which ID filter if Last3 col and First3 Col are the same respectively

Try this approach -

#group by ID1 and check if duplicates in last3. Then extract the index number that satisfies condition
c1 = df.groupby('ID1').apply(pd.DataFrame.duplicated, subset=['Last3'], keep=False)
c1_idx = c1[c1].droplevel(0).index

#group by ID2 and check if duplicates in first3. Then extract the index number that satisfies condition
c2 = df.groupby('ID2').apply(pd.DataFrame.duplicated, subset=['First3'], keep=False)
c2_idx = c2[c2].droplevel(0).index

#take a union of the 2 indexes and then ..
#filter dataframe for the indexes that meet the 2 independent conditions
output = df.iloc[c1_idx.union(c2_idx)]
print(output)

   First   Last   Location             ID1            ID2 First3 Last3
0   John  Smith    Toronto     JohnToronto   SmithToronto    Joh   Smi
1    Joh  Smith    Toronto      JohToronto   SmithToronto    Joh   Smi
2  Steph    Sax  Vancouver  StephVancouver   SaxVancouver    Ste   Sax
3  Steph     Sa  Vancouver  StephVancouver  SaxeVancouver    Ste   Sax
4  Stacy    Lee    Markham    StacyMarkham     LeeMarkham    Sta   Lee
5   Stac    Lee    Markham     StacMarkham     LeeMarkham    Sta   Lee

EDIT: Modifying the above answer provided by @SomeDude, you can run this as 2 independent conditions without a groupby and take an OR between them as well -

m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df.duplicated(subset=['ID2','First3'],keep=False)
df[m1 | m2]

CodePudding user response：

You may use:

df = pd.DataFrame({
    'First':['John', 'Joh', 'Steph', 'Steph', 'Stacy', 'Stac', 'Stac', 'Stac', 'Steve', 'John'],
    'Last':['Smith', 'Smith', 'Sax', 'Sa', 'Lee', 'Lee', 'Wong', 'Lee', 'Smith', 'Jones'],
    'Location':['Toronto', 'Toronto', 'Vancouver', 'Vancouver', 'Markham', 
                'Markham', 'Aurora', 'Newmarket', 'Toronto', 'Toronto'],
    'ID1':['JohnToronto', 'JohnToronto', 'StephVancouver', 'StephVancouver', 'StacyMarkham',
          'StacyMarkham','StacAurora', 'StacNewmarket','SteveToronto','JohnToronto'],
    'ID2':['SmithToronto','SmithToronto','SaxVancouver','SaxVancouver',
          'LeeMarkham','LeeMarkham','LeeAurora','LeeNewmarket','SmithToronto','JonesToronto'],
    
    'First3':['Joh','Joh','Ste','Ste','Sta','Sta','Sta','Sta','Ste','Joh'],
    'Last3':['Smi','Smi','Sax','Sax','Lee','Lee','Won','Lee','Smi','Jon']
})
m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df[m1].duplicated(subset=['ID2','First3'],keep=False)
df = df[m1 & m2]