Find a pattern in middle of multiple sentences-CodePudding

I have a dataframe as below,

data = [
    [ 1, 'AR-123456' ],
    [ 1, '123456' ],
    [ 2, '345678' ],
    [ 3,'Application-12345678901'],
    [ 3, '12345678901']
]
df = pd.DataFrame(data, columns=['Case', 'ID'] )

Case	ID
1	AR-123456
1	123456
2	345678
3	Application-12345678901
3	12345678901

So basically I am trying to remove rows where for the same Case, the IDs are digits from AR- or Application- i.e., the final expected output :-

Case	ID
1	AR-123456
2	345678
3	Application-12345678901

CodePudding user response：

Extract the digits and drop_duplicates:

df["digits"] = df["ID"].str.extract("(\d )")
output = df.drop_duplicates(["Case","digits"]).drop("digits",axis=1)

>>> output
   Case                       ID
0     1                AR-123456
2     2                   345678
3     3  Application-12345678901

CodePudding user response：

You can groupby Case column and then drop the duplicated ID rows that is in AR/Application item

out = (df.groupby('Case')
       .apply(lambda g: g[~g['ID'].isin(g['ID'].str.extract('(AR|Application)-(\d )')[1])])
       .reset_index(drop=True))

print(out)

   Case                       ID
0     1                AR-123456
1     2                   345678
2     3  Application-12345678901