Using Python how can you use a group-by to filter this dataset
Start
First Last Location ID1 ID2 First3 Last3
John Smith Toronto JohnToronto SmithToronto Joh Smi
Joh Smith Toronto JohToronto SmithToronto Joh Smi
Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
Stac Lee Markham StacMarkham LeeMarkham Sta Lee
Stac Wong Aurora StacAurora LeeAurora Sta Won
Stac Lee Newmarket StacNewmarket LeeNewmarket Sta Lee
Steve Smith Toronto SteveToronto SmithToronto Ste Smi
John Jones Toronto JohnToronto JonesToronto Joh Jon
How can I make it so that where either the two conditions are accepted, filtering everything else that doesn't meet these two criteria
- ID1 - Matches another ID1 and the Last3 are the same
- ID2 - Matches another ID2 and the First 3 are the same
End
First Last Location ID1 ID2 First3 Last3
John Smith Toronto JohnToronto SmithToronto Joh Smi
Joh Smith Toronto JohToronto SmithToronto Joh Smi
Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
Stac Lee Markham StacMarkham LeeMarkham Sta Lee
CodePudding user response:
Based on comment for clarification of the problem statement -
trying to groupby ID1 or ID2. And then depending which ID filter if Last3 col and First3 Col are the same respectively
Try this approach -
#group by ID1 and check if duplicates in last3. Then extract the index number that satisfies condition
c1 = df.groupby('ID1').apply(pd.DataFrame.duplicated, subset=['Last3'], keep=False)
c1_idx = c1[c1].droplevel(0).index
#group by ID2 and check if duplicates in first3. Then extract the index number that satisfies condition
c2 = df.groupby('ID2').apply(pd.DataFrame.duplicated, subset=['First3'], keep=False)
c2_idx = c2[c2].droplevel(0).index
#take a union of the 2 indexes and then ..
#filter dataframe for the indexes that meet the 2 independent conditions
output = df.iloc[c1_idx.union(c2_idx)]
print(output)
First Last Location ID1 ID2 First3 Last3
0 John Smith Toronto JohnToronto SmithToronto Joh Smi
1 Joh Smith Toronto JohToronto SmithToronto Joh Smi
2 Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
3 Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
4 Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
5 Stac Lee Markham StacMarkham LeeMarkham Sta Lee
EDIT: Modifying the above answer provided by @SomeDude, you can run this as 2 independent conditions without a groupby and take an OR between them as well -
m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df.duplicated(subset=['ID2','First3'],keep=False)
df[m1 | m2]
CodePudding user response:
You may use:
df = pd.DataFrame({
'First':['John', 'Joh', 'Steph', 'Steph', 'Stacy', 'Stac', 'Stac', 'Stac', 'Steve', 'John'],
'Last':['Smith', 'Smith', 'Sax', 'Sa', 'Lee', 'Lee', 'Wong', 'Lee', 'Smith', 'Jones'],
'Location':['Toronto', 'Toronto', 'Vancouver', 'Vancouver', 'Markham',
'Markham', 'Aurora', 'Newmarket', 'Toronto', 'Toronto'],
'ID1':['JohnToronto', 'JohnToronto', 'StephVancouver', 'StephVancouver', 'StacyMarkham',
'StacyMarkham','StacAurora', 'StacNewmarket','SteveToronto','JohnToronto'],
'ID2':['SmithToronto','SmithToronto','SaxVancouver','SaxVancouver',
'LeeMarkham','LeeMarkham','LeeAurora','LeeNewmarket','SmithToronto','JonesToronto'],
'First3':['Joh','Joh','Ste','Ste','Sta','Sta','Sta','Sta','Ste','Joh'],
'Last3':['Smi','Smi','Sax','Sax','Lee','Lee','Won','Lee','Smi','Jon']
})
m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df[m1].duplicated(subset=['ID2','First3'],keep=False)
df = df[m1 & m2]