I have column containing names. I want to remove the name and the ; where its marked as (Retired) or (retired) after the name. But the problem is, it does not appear in the same format. Sometimes the cell has multiple names and one name is retried. In another case the cell will have first name followed by retired then the last name.
Dataframe = df
Sample column values - Current state
Owner Name
George (Georgy) (Retired) Clooney
Meghan (retired) Markle
Harry Porter (Retired)
Hermione Granger; Harry Porter (Retired)
Ginny Weasley; Ron Weasley; Harry Porter (retired); Luna Lovegood
Sample column values - Future state
Owner Name
Null
Null
Null
Hermione Granger
Ginny Weasley; Ron Weasley; Luna Lovegood
I thought of using replace with "" but it does not work. Please. I would appreciate any directions.
CodePudding user response:
split
, filter, join again with groupby.agg
:
df['Owner Name'] = (df['Owner Name']
.str.split(';\s*', expand=True).stack()
.loc[lambda s: ~s.str.contains('\(Retired\)', case=False)]
.groupby(level=0).agg('; '.join)
)
Output:
Owner Name
0 NaN
1 NaN
2 NaN
3 Hermione Granger
4 Ginny Weasley; Ron Weasley; Luna Lovegood
CodePudding user response:
With single regex replacement:
df['Owner Name'] = df['Owner Name'].str.replace(r'[^;]*\(retired\)[^;]*;?', "", regex=True, case=False)\
.str.strip(';').replace("", np.nan)
Owner Name
0 NaN
1 NaN
2 NaN
3 Hermione Granger
4 Ginny Weasley; Ron Weasley; Luna Lovegood
Time execution comparison (just for the case):
In [364]: %timeit df['Owner Name'].str.replace(r'[^;]*\(retired\)[^;]*;?', "", regex=True, case=False).str.strip(';'
...: ).replace("", np.nan)
322 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [365]: %timeit df['Owner Name'].str.split(';\s*', expand=True).stack().loc[lambda s: ~s.str.contains('\(Retired\)
...: ', case=False)].groupby(level=0).agg('; '.join)
1.19 ms ± 8.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)