Pandas regex replace negation-CodePudding

I have the following dataframe:

>>> df = pd.DataFrame(['0123_GRP_LE_BNS', 'ABC_GRP_BNS', 'DEF_GRP', '456A_GRP_SSA'], columns=['P'])
>>> df
                 P
0  0123_GRP_LE_BNS
1      ABC_GRP_BNS
2          DEF_GRP
3     456A_GRP_SSA

and want to remove characters appear after GRP if they are not '_LE', or remove characters after GRP_LE.

The desired output is:

0     0123_GRP_LE
1         ABC_GRP
2         DEF_GRP
3        456A_GRP

I used the following pattern matching. the ouput was not expected:

>>> df['P'].replace({r'(.*_GRP)[^_LE].*':r'\1', r'(.*GRP_LE)_.*':r'\1'}, regex=True)
0     0123_GRP_LE
1     ABC_GRP_BNS
2         DEF_GRP
3    456A_GRP_SSA
Name: P, dtype: object

Can someone help with diagnosis?

CodePudding user response：

Why not make _LE optional?

df['P'].str.replace(r'(GRP(?:_LE)?).*', r'\1', regex=True)

Output:

0    0123_GRP_LE
1        ABC_GRP
2        DEF_GRP
3       456A_GRP
Name: P, dtype: object

CodePudding user response：

I find pythons string ops easier to work with and less error prone than regex; I think this does what you're looking for:

def strip_code(code_str):
    if "GRP_LE" in code_str:
        return "".join(code_str.partition("GRP_LE")[0:2])
    elif "GRP" in code_str:
        return "".join(code_str.partition("GRP")[0:2])
    return code_str


df.P.apply(strip_code)

output:

0    0123_GRP_LE
1        ABC_GRP
2        DEF_GRP
3       456A_GRP
Name: P, dtype: object