I've got 2 Dataframes :
df_bisID = pd.DataFrame({'A': ['ID1#ID2', 'ID3#ID4'],
'B': ['ID5#ID6', 'ID7#ID8'],
'C': ['ID9#ID10', 'ID11#ID12'],
'D': ['ID13#ID14', 'ID15#ID16']})
A B C D
0 ID1#ID2 ID5#ID6 ID9#ID10 ID13#ID14
1 ID3#ID4 ID7#ID8 ID11#ID12 ID15#ID16
df_eliminateID = pd.DataFrame({'A': ['ID1', 'ID3', 'ID9', 'ID11'],
'B': ['ID5', 'ID7', 'ID13', 'ID15']})
A B
0 ID1 ID5
1 ID3 ID7
2 ID9 ID13
3 ID11 ID15
The IDs are complicated, with alphanumerics, and unique ('xxxx-xxxxxxxxxx-xxx' facultative suffix '-xxx-xxx'). I propose an example with simplified IDs.
In df_bisID I need to get rid of one of both ID and #, in each column, where it's respectively contained in columns A or B in df_eliminateID.
So I want to replace 'ID1#ID2' by 'ID2' and 'ID9#ID10' by 'ID10', etc.
Note that it also could be 'ID2#ID1' instead of 'ID1#ID2' and it's complicating the replacement conditions, that's also the cause of this question string replacement.
A logical solution to my problem is :
for i in df_eliminateID['A']:
df_bisID.replace(i '#|#' i, '')
for i in df_eliminateID['B']:
df_bisID.replace(i '#|#' i, '')
But at the end, my df_bisID is empty, that's why I ask your help with the use of | regex in my string replacement, I want to replace IDX# or #IDX by '' to conserve only one of the IDs. The OR is | in regex, maybe I didn't used it the right way (is there an exclusif or in regex ?)
In advance, thanks for your help.
CodePudding user response:
I highly recommend checking out the string ops available to pandas objects, especially the str.replace function, which I think is exactly what you want here:
for i in df_eliminateID['A']:
df_bisID = df_bisID.str.replace(f'{i}#|#{i}', '')
for i in df_eliminateID['B']:
df_bisID = df_bisID.str.replace(f'{i}#|#{i}', '')
CodePudding user response:
You can rework your columns index, stack
, remove the trailing part of the string and unstack
:
(df_bisID.set_axis(pd.MultiIndex.from_product([[0,1],
df_bisID.columns[:2]]),
axis=1)
.stack(level=[0,1])
.str.split('#').str[0]
.unstack()
.reset_index(drop=True)
)
output:
A B
0 ID1 ID5
1 ID9 ID13
2 ID3 ID7
3 ID11 ID15