Advanced lookups in Pandas Dataframe-CodePudding

I have a dataframe with 2 columns level1 and level2. Each account number in level1 is linked to a ParentID found in column level2.

df = pd.DataFrame([[7854568400,489],
                    [9632588400,126],
                    [3699633691,189],
                    [9876543697,987],
                    [7854568409,396],
                    [7854567893,897],      
                    [9632588409,147]],
columns = ['level1','level2'])
df

Output:

level1  level2
0   7854568400  489
1   9632588400  126
2   3699633691  189
3   9876543697  987
4   7854568409  396
5   7854567893  897
6   9632588409  147

For accounts ending in "8409" in column "level1" they are mapped to the wrong ParentID in level2. To find its correct ParentID, you need to search in level1 where you replace all accounts that end in"8409" with "8400". This will then find its equivalent account in the same column. Where a match is found, copy what is in column "level2" and replace it under the column for the accounts ending in "8409".

In the desired output below, account "7854568409" had its level2 changed from 396 to 489 (taken from row 0), and account "9632588409" had its level2 changed from 147 to 126 (taken from row 1). Note that nothing gets edited in column "level1" only in "level2".

Desired Output:

level1  level2
0   7854568400  489
1   9632588400  126
2   3699633691  189
3   9876543697  987
4   7854568409  489
5   7854567893  897
6   9632588409  126

Any thoughts on how to achieve this would be great.

CodePudding user response：

You can update the level2 column from a self-merge of df with df where in the latter (right side of merge) the level1 column is updated to the correct account numbers:

import pandas as pd

df = pd.DataFrame([[7854568400,489],
                    [9632588400,126],
                    [3699633691,189],
                    [9876543697,987],
                    [7854568409,396],
                    [7854567893,897],      
                    [9632588409,147]],
                  columns = ['level1','level2'])

df.level2 = df.merge(df.assign(level1 = df.level1.astype(str).str.replace('8409$', '8400', regex=True).astype('int64')), on='level1', how='right')['level2_x']

Result:

       level1  level2
0  7854568400     489
1  9632588400     126
2  3699633691     189
3  9876543697     987
4  7854568409     489
5  7854567893     897
6  9632588409     126

CodePudding user response：

You probably want to start by identifying which level1 values end in 8409. You can do this with the pd.Series.str() command.

ends_with_8409 = df["level1"].astype( str ).str[-4:] == "8409"

This will create a boolean series of the same length as df["level1"] and now you know which rows need a replacement in level2.

Now for getting these rows' appropriate level2 value, you probably want to do a merge (especially if your dataset is large this will be a lot faster than the map or apply alternatives).

df["temp_level1"] = df["level1"].where( ~ends_with_8409, df["level1"].astype( str ).str[0:-4]   "8400" ).astype( int )
final_df = df[ ["temp_level1"] ].merge( df[ ["level1","level2"] ], left_on = "temp_level1", right_on = "level1", how="left" ).drop( columns="temp_level1" )

Here, we create a column that replaces each level1 value ending with "8409" with the same value but ending in "8400". Then, merge that onto the original dataframe to get the appropriate level2 matches.

final_df[ "level1" ] = final_df[ "level1" ].where( ~ends_with_8409, final_df.astype( str ).str[0:-4]   "8409" ).astype( int )

Finally, using the ends_with_8409 boolean mask again, replace the relevant level1 values so that they end with "8409" again.

>>> final_df
           level1  level2
0  7854568400     489
1  9632588400     126
2  3699633691     189
3  9876543697     987
4  7854568409     489
5  7854567893     897
6  9632588409     126

Unfortunately, I don't think you can get around using astype so many times. pd.Series.str() only works if this series is string type and the merge will only work if the merged columns are the same type. Here is the documentation on pd.DataFrame.where().