Home > Blockchain >  Duplicate Rows in a Pandas DataFrame and replacing values by multiple other values
Duplicate Rows in a Pandas DataFrame and replacing values by multiple other values

Time:10-30

First of all, hi everyone! This is the first time I am actually posting a question on StackOverflow, so if I am too specific / too general, I would appreciate receiving advise :).

I have a Pandas DataFrame containing some SAP Authorization Data, in which a column can contain something like "placeholder values" which shall be resolved to their corresponding values. And at this point, I really run out of Ideas..

For Example, in the DataFrame shown below I have two Roles, each containing the (authorization) object F_BKPF_BUK with the Fields ACTVT and ABC. ABC is characterized with the "placeholder value" $EKGRP for LOW.

                   ROLE        OBJECT    FIELD     LOW HIGH
0  D:AS:MY_FANCY_ROLE_A    F_BKPF_BUK    ACTVT      03  NaN
1  D:AS:MY_FANCY_ROLE_A    F_BKPF_BUK      ABC  $EKGRP  NaN
2  D:AS:MY_FANCY_ROLE_B    F_BKPF_BUK    ACTVT      03  NaN
3  D:AS:MY_FANCY_ROLE_B    F_BKPF_BUK      ABC  $EKGRP  NaN

Now, the tricky thing is, that the placeholder $EKGRP usually resolves to role dependent (!) multiple values. The DataFrame for $EKGRP is as follows:

                   ROLE   VARBL  LOW HIGH
0  D:AS:MY_FANCY_ROLE_A  $EKGRP  U01  U99
1  D:AS:MY_FANCY_ROLE_A  $EKGRP  P01  P99
2  D:AS:MY_FANCY_ROLE_A  $EKGRP  P01  P29
3  D:AS:MY_FANCY_ROLE_B  $EKGRP  P01  P00
4  D:AS:MY_FANCY_ROLE_B  $EKGRP  N01  N99
5  D:AS:MY_FANCY_ROLE_B  $EKGRP  I01  I99

So the final result I would like to achieve is to substitute all occurrences of a placeholder with its corresponding values for both columns LOW and HIGH:

                   ROLE        OBJECT    FIELD     LOW HIGH
0  D:AS:MY_FANCY_ROLE_A    F_BKPF_BUK    ACTVT      03  NaN
1  D:AS:MY_FANCY_ROLE_A    F_BKPF_BUK      ABC     U01  U99
2  D:AS:MY_FANCY_ROLE_A    F_BKPF_BUK      ABC     P01  P99
3  D:AS:MY_FANCY_ROLE_A    F_BKPF_BUK      ABC     P01  P29
4  D:AS:MY_FANCY_ROLE_B    F_BKPF_BUK    ACTVT      03  NaN
5  D:AS:MY_FANCY_ROLE_B    F_BKPF_BUK      ABC     P01  P00
6  D:AS:MY_FANCY_ROLE_B    F_BKPF_BUK      ABC     N01  N99
7  D:AS:MY_FANCY_ROLE_B    F_BKPF_BUK      ABC     I01  I99

Started using Pandas only a few weeks ago, I soon reached a point where I ran out of ideas for this particular problem. My latest guess was to maybe use df.apply(...) to check for a placeholder, but this approach would not solve the issue that, once a placeholder is found, the original line has to be duplicated several times with their LOW and HIGH values getting changed to the corresponding values. Which pandas function would you recommend me to take a closer look at? I would like to avoid row-by-row iterations as far as possible and get to know the "best practices" for those kind of problems.

CodePudding user response:

If possible use outer join by column LOW from df1 with ROLE first copy LOW to VARBL in DataFrame.assign, then replace missing values in DataFrame.fillna (necessary remove _ in columns for match) and last remove unnecesary columns:

df = (df1.assign(VARBL = df1['LOW'])
         .merge(df2, on=['ROLE','VARBL'], how='outer', suffixes=('_','')))

df[['LOW','HIGH']] = (df[['LOW','HIGH']].fillna(df[['LOW_','HIGH_']]
                                                  .rename(columns=lambda x: x.strip('_'))))
df = df.drop(['LOW_','HIGH_','VARBL'], axis=1)
print (df)
                   ROLE      OBJECT  FIELD  LOW HIGH
0  D:AS:MY_FANCY_ROLE_A  F_BKPF_BUK  ACTVT   03  NaN
1  D:AS:MY_FANCY_ROLE_A  F_BKPF_BUK    ABC  U01  U99
2  D:AS:MY_FANCY_ROLE_A  F_BKPF_BUK    ABC  P01  P99
3  D:AS:MY_FANCY_ROLE_A  F_BKPF_BUK    ABC  P01  P29
4  D:AS:MY_FANCY_ROLE_B  F_BKPF_BUK  ACTVT   03  NaN
5  D:AS:MY_FANCY_ROLE_B  F_BKPF_BUK    ABC  P01  P00
6  D:AS:MY_FANCY_ROLE_B  F_BKPF_BUK    ABC  N01  N99
7  D:AS:MY_FANCY_ROLE_B  F_BKPF_BUK    ABC  I01  I99
  • Related