I have the following dataframe:
>>> df = pd.DataFrame(['0123_GRP_LE_BNS', 'ABC_GRP_BNS', 'DEF_GRP', '456A_GRP_SSA'], columns=['P'])
>>> df
P
0 0123_GRP_LE_BNS
1 ABC_GRP_BNS
2 DEF_GRP
3 456A_GRP_SSA
and want to remove characters appear after GRP if they are not '_LE', or remove characters after GRP_LE.
The desired output is:
0 0123_GRP_LE
1 ABC_GRP
2 DEF_GRP
3 456A_GRP
I used the following pattern matching. the ouput was not expected:
>>> df['P'].replace({r'(.*_GRP)[^_LE].*':r'\1', r'(.*GRP_LE)_.*':r'\1'}, regex=True)
0 0123_GRP_LE
1 ABC_GRP_BNS
2 DEF_GRP
3 456A_GRP_SSA
Name: P, dtype: object
Can someone help with diagnosis?
CodePudding user response:
Why not make _LE
optional?
df['P'].str.replace(r'(GRP(?:_LE)?).*', r'\1', regex=True)
Output:
0 0123_GRP_LE
1 ABC_GRP
2 DEF_GRP
3 456A_GRP
Name: P, dtype: object
CodePudding user response:
I find pythons string ops easier to work with and less error prone than regex; I think this does what you're looking for:
def strip_code(code_str):
if "GRP_LE" in code_str:
return "".join(code_str.partition("GRP_LE")[0:2])
elif "GRP" in code_str:
return "".join(code_str.partition("GRP")[0:2])
return code_str
df.P.apply(strip_code)
output:
0 0123_GRP_LE
1 ABC_GRP
2 DEF_GRP
3 456A_GRP
Name: P, dtype: object