I have a dataframe as below,
data = [
[ 1, 'AR-123456' ],
[ 1, '123456' ],
[ 2, '345678' ],
[ 3,'Application-12345678901'],
[ 3, '12345678901']
]
df = pd.DataFrame(data, columns=['Case', 'ID'] )
Case | ID |
---|---|
1 | AR-123456 |
1 | 123456 |
2 | 345678 |
3 | Application-12345678901 |
3 | 12345678901 |
So basically I am trying to remove rows where for the same Case, the IDs are digits from AR- or Application- i.e., the final expected output :-
Case | ID |
---|---|
1 | AR-123456 |
2 | 345678 |
3 | Application-12345678901 |
CodePudding user response:
Extract the digits and drop_duplicates
:
df["digits"] = df["ID"].str.extract("(\d )")
output = df.drop_duplicates(["Case","digits"]).drop("digits",axis=1)
>>> output
Case ID
0 1 AR-123456
2 2 345678
3 3 Application-12345678901
CodePudding user response:
You can groupby Case
column and then drop the duplicated ID
rows that is in AR
/Application
item
out = (df.groupby('Case')
.apply(lambda g: g[~g['ID'].isin(g['ID'].str.extract('(AR|Application)-(\d )')[1])])
.reset_index(drop=True))
print(out)
Case ID
0 1 AR-123456
1 2 345678
2 3 Application-12345678901