I have a dataframe with 2 columns level1 and level2. Each account number in level1 is linked to a ParentID found in column level2.
df = pd.DataFrame([[7854568400,489],
[9632588400,126],
[3699633691,189],
[9876543697,987],
[7854568409,396],
[7854567893,897],
[9632588409,147]],
columns = ['level1','level2'])
df
Output:
level1 level2
0 7854568400 489
1 9632588400 126
2 3699633691 189
3 9876543697 987
4 7854568409 396
5 7854567893 897
6 9632588409 147
For accounts ending in "8409" in column "level1" they are mapped to the wrong ParentID in level2. To find its correct ParentID, you need to search in level1 where you replace all accounts that end in"8409" with "8400". This will then find its equivalent account in the same column. Where a match is found, copy what is in column "level2" and replace it under the column for the accounts ending in "8409".
In the desired output below, account "7854568409" had its level2 changed from 396 to 489 (taken from row 0), and account "9632588409" had its level2 changed from 147 to 126 (taken from row 1). Note that nothing gets edited in column "level1" only in "level2".
Desired Output:
level1 level2
0 7854568400 489
1 9632588400 126
2 3699633691 189
3 9876543697 987
4 7854568409 489
5 7854567893 897
6 9632588409 126
Any thoughts on how to achieve this would be great.
CodePudding user response:
You can update the level2
column from a self-merge of df
with df
where in the latter (right side of merge) the level1
column is updated to the correct account numbers:
import pandas as pd
df = pd.DataFrame([[7854568400,489],
[9632588400,126],
[3699633691,189],
[9876543697,987],
[7854568409,396],
[7854567893,897],
[9632588409,147]],
columns = ['level1','level2'])
df.level2 = df.merge(df.assign(level1 = df.level1.astype(str).str.replace('8409$', '8400', regex=True).astype('int64')), on='level1', how='right')['level2_x']
Result:
level1 level2
0 7854568400 489
1 9632588400 126
2 3699633691 189
3 9876543697 987
4 7854568409 489
5 7854567893 897
6 9632588409 126
CodePudding user response:
You probably want to start by identifying which level1 values end in 8409. You can do this with the pd.Series.str() command.
ends_with_8409 = df["level1"].astype( str ).str[-4:] == "8409"
This will create a boolean series of the same length as df["level1"]
and now you know which rows need a replacement in level2.
Now for getting these rows' appropriate level2 value, you probably want to do a merge (especially if your dataset is large this will be a lot faster than the map or apply alternatives).
df["temp_level1"] = df["level1"].where( ~ends_with_8409, df["level1"].astype( str ).str[0:-4] "8400" ).astype( int )
final_df = df[ ["temp_level1"] ].merge( df[ ["level1","level2"] ], left_on = "temp_level1", right_on = "level1", how="left" ).drop( columns="temp_level1" )
Here, we create a column that replaces each level1 value ending with "8409" with the same value but ending in "8400". Then, merge that onto the original dataframe to get the appropriate level2 matches.
final_df[ "level1" ] = final_df[ "level1" ].where( ~ends_with_8409, final_df.astype( str ).str[0:-4] "8409" ).astype( int )
Finally, using the ends_with_8409
boolean mask again, replace the relevant level1 values so that they end with "8409" again.
>>> final_df
level1 level2
0 7854568400 489
1 9632588400 126
2 3699633691 189
3 9876543697 987
4 7854568409 489
5 7854567893 897
6 9632588409 126
Unfortunately, I don't think you can get around using astype so many times. pd.Series.str() only works if this series is string type and the merge will only work if the merged columns are the same type. Here is the documentation on pd.DataFrame.where().