Home > Net >  Advanced Querying in Pandas Dataframe
Advanced Querying in Pandas Dataframe

Time:12-02

I have a dataframe with 2 columns level1 and level2. Each account number in level1 is linked to a ParentID found in column level2. For accounts ending in "8409" in column "level1" some of them are mapped to the wrong ParentID in level2. To find its correct ParentID, you need to search in level1 where you replace all accounts that end in"8409" with "8400". This will then find its equivalent account in the same column. Where a match is found, copy what is in column "level2" and replace it under the column for the accounts ending in "8409".

import pandas as pd
import numpy as np
df = pd.DataFrame([[7854568400,489],
                    [9632588400,126],
                    [3699633691,189],
                    [9876543697,987],
                    [1111118409,987],
                    [7854568409,396],
                    [7854567893,897],      
                    [9632588409,147]],
columns = ['level1','level2'])
df

The solution below allowed to create a new column "new_level2" to solve the above problem.

maps = df.set_index('level1')['level2']
s = df['level1'].astype(str).str.replace('8409$', '8400', regex=True).astype('int64')
df['new_level2'] = s.map(maps).combine_first(df['level2']).convert_dtypes()

In the output below, account "7854568409" had its level2 changed from 396 to 489 (taken from row 0), and account "9632588409" had its level2 changed from 147 to 126 (taken from row 1).

level1  level2  new_level2
0   7854568400  489 489
1   9632588400  126 126
2   3699633691  189 189
3   9876543697  987 987
4   1111118409  987 987
5   7854568409  396 489
6   7854567893  897 897
7   9632588409  147 126

However, when I apply the above solution with to additional variables this is where I run into problems mainly when I add the currency to the dataframe. The replacing of level2 values ONLY applies to USd only, all other currencies need to retain its current value in column level2.

Old DF

df = pd.DataFrame([['USD',7854568400,489],
                    ['USD',9632588400,126],
                    ['USD',3699633691,189],
                    ['USD',9876543697,987],
                    ['EUR',1111118409,987],
                    ['USD',1111118409,987],
                    ['USD',7854568409,396],
                    ['USD',7854567893,897],      
                    ['USD',9632588409,147]],
columns = ['cur','level1','level2'])
df

Revised DF

df = pd.DataFrame([['USD',7854568400,489],
                    ['USD',9632588400,126],
                    ['USD',3699633691,189],
                    ['USD',9876543697,987],
                    ['EUR',1111118400,120],
                    ['EUR',1111118409,987],
                    ['USD',1111118409,987],
                    ['USD',7854568409,396],
                    ['USD',7854567893,897],      
                    ['USD',9632588409,147]],
columns = ['cur','level1','level2'])

I get the following error when i attempt to apply the above solution to the new dataframe that contains the currency. InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Its worth noting that you can have the same account number across different currencies.

Desired output is below. Only 2 accounts (9632588409 & 7854568409) had their level2 changed. Index 4 and 5 should retain its original level2 value because they are euro and not in scope and Index 6 because there was no corresponding match found for this account, therefore it retains it original value.

   cur  level1     level2   new_level2
0   USD 7854568400  489     489
1   USD 9632588400  126     126
2   USD 3699633691  189     189
3   USD 9876543697  987     987
4   EUR 1111118400  120     120
5   EUR 1111118409  987     987
6   USD 1111118409  987     987
7   USD 7854568409  396     489
8   USD 7854567893  897     897
9   USD 9632588409  147     126

Any help is greatly appreciated.

CodePudding user response:

df = pd.DataFrame([['USD',7854568400,489],
                    ['USD',9632588400,126],
                    ['USD',3699633691,189],
                    ['USD',9876543697,987],
                    ['EUR',1111118400,120],
                    ['EUR',1111118409,987],
                    ['USD',1111118409,987],
                    ['USD',7854568409,396],
                    ['USD',7854567893,897],      
                    ['USD',9632588409,147]],
columns = ['cur','level1','level2'])

df['level1'] = df['level1'].astype(str).str.replace('8409$', '8400', regex=True).astype('int64')

df['new_col'] = df.where(df['cur'] == 'USD').groupby(['level1', 'cur'])['level2']\
.transform('first').fillna(df['level2']).astype(int)

print(df)

   cur      level1  level2  new_col
0  USD  7854568400     489      489
1  USD  9632588400     126      126
2  USD  3699633691     189      189
3  USD  9876543697     987      987
4  EUR  1111118400     120      120
5  EUR  1111118400     987      987
6  USD  1111118400     987      987
7  USD  7854568400     396      489
8  USD  7854567893     897      897
9  USD  9632588400     147      126
  • Related