Home > Enterprise >  Python - Pandas - String Replacement with regex | (OR)
Python - Pandas - String Replacement with regex | (OR)

Time:10-19

I've got 2 Dataframes :

df_bisID = pd.DataFrame({'A': ['ID1#ID2', 'ID3#ID4'],
                         'B': ['ID5#ID6', 'ID7#ID8'],
                         'C': ['ID9#ID10', 'ID11#ID12'],
                         'D': ['ID13#ID14', 'ID15#ID16']})

    A       B       C           D
0   ID1#ID2 ID5#ID6 ID9#ID10    ID13#ID14
1   ID3#ID4 ID7#ID8 ID11#ID12   ID15#ID16


df_eliminateID = pd.DataFrame({'A': ['ID1', 'ID3', 'ID9', 'ID11'],
                               'B': ['ID5', 'ID7', 'ID13', 'ID15']})

    A    B
0   ID1  ID5
1   ID3  ID7
2   ID9  ID13
3   ID11 ID15

The IDs are complicated, with alphanumerics, and unique ('xxxx-xxxxxxxxxx-xxx' facultative suffix '-xxx-xxx'). I propose an example with simplified IDs.

In df_bisID I need to get rid of one of both ID and #, in each column, where it's respectively contained in columns A or B in df_eliminateID.

So I want to replace 'ID1#ID2' by 'ID2' and 'ID9#ID10' by 'ID10', etc.

Note that it also could be 'ID2#ID1' instead of 'ID1#ID2' and it's complicating the replacement conditions, that's also the cause of this question string replacement.

A logical solution to my problem is :

for i in df_eliminateID['A']:
    df_bisID.replace(i '#|#' i, '')
for i in df_eliminateID['B']:
    df_bisID.replace(i '#|#' i, '')

But at the end, my df_bisID is empty, that's why I ask your help with the use of | regex in my string replacement, I want to replace IDX# or #IDX by '' to conserve only one of the IDs. The OR is | in regex, maybe I didn't used it the right way (is there an exclusif or in regex ?)

In advance, thanks for your help.

CodePudding user response:

I highly recommend checking out the string ops available to pandas objects, especially the str.replace function, which I think is exactly what you want here:

for i in df_eliminateID['A']:    
    df_bisID = df_bisID.str.replace(f'{i}#|#{i}', '')
for i in df_eliminateID['B']:
    df_bisID = df_bisID.str.replace(f'{i}#|#{i}', '')

CodePudding user response:

You can rework your columns index, stack, remove the trailing part of the string and unstack:

(df_bisID.set_axis(pd.MultiIndex.from_product([[0,1],
                                               df_bisID.columns[:2]]),
                   axis=1)
         .stack(level=[0,1])
         .str.split('#').str[0]
         .unstack()
         .reset_index(drop=True)
) 

output:

      A     B
0   ID1   ID5
1   ID9  ID13
2   ID3   ID7
3  ID11  ID15
  • Related