Home > Software engineering >  *Not Iterate* Rows in Dataframe and insert column based on values in list of dictionaries
*Not Iterate* Rows in Dataframe and insert column based on values in list of dictionaries

Time:06-30

I have a DataFrame and a list of dictionaries. The DataFrame has 84k rows. Each row is an account for a specific client.

Each dict in the list belongs to a specific client. They can have up to 50 keys and as few as 2 keys. The dictionaries also need to be applied in the order they are listed. The first key/value in each dict shows the name of the client the dict belongs to. The second key/value is the name of the rule.

List of Dict Example:


0 {'client': 'client 1', 'Billing Code': 'TNL', 'Valuations': '0', 'Account Number': '>99999'}
1 {'client': 'client 1', 'Billing Code': 'MF', 'User': 'BP', 'Flag': 'S'}
...
13 {'client': 'client 2', 'Billing Code': 'TNL', 'Acct Desc': '*test*}

length: 427, dtype: object

DataFrame has these column names

df.columns = ['Source.Name','User Bank','Bank','Account Number','Account Description','Valuation Date',
              'Preschedule','MF Flag','Load Flag','Global Flag','Money Market Flag','Days Prior to Valuation',
              'Number of Holdings','Total Assets','Unit Value/NAV','MCS Field','From Date','Valuations',
              '# Sweeps','NASDAQ','TLA','Account Type','Fund Group','Master Account Text','Master Feeder Flag',
              'Acct Flag 2','Acct Field 4','Securities At Value','Net Assets','Acct Field 1','Acct Field 2',
              'Group Account Indicator','Group Account Number','Region','Account Status','SMS Billing Code',
             'Translation Date','Portfolio Manager 1','Acct Flag 1','Dual Flag','Securities At Value Base',
             'Net Assets Base','Total Assets Base','Dual OEIC']

Input DataFrame Dataframe containing data directly from client files

Client Short Name Source.Name User Bank Bank Account Number Account Description
Client #1 C#1.txt AA 01 1 Test Account
Client #1 C#1.txt AC 01 2 MY ACCOUNT
Client #1 C#1.txt AC 01 3 SUPER FUND
Client #1 C#1.txt AY 01 4 S&P INDEX
Client #1 C#1.txt AY 01 5 Test Account
Client #1 C#1.txt AA 01 6 INDEX
Client #1 C#1.txt AA 01 7 Test Account
Client #1 C#1.txt AA 01 8 RYAN'S Account
Client #2 C#2.txt BA 01 1 Test Account
Client #2 C#2.txt BB 01 33 INDEX
Client #2 C#2.txt BB 01 92 Test Account
Client #2 C#2.txt BZ 01 123123 INDEX
Client #3 C#3.txt BB 01 1657 Test Account
Client #3 C#3.txt BP 01 15454 Test Account
Client #4 C#4.txt GH 01 100 Test Account
Client #4 C#4.txt GH 01 19875 INDEX
Client #4 C#4.txt GY 01 13579 Test Account
Client #4 C#4.txt GE 01 2 INDEX
Client #4 C#4.txt GE 01 72 Test Account
Client #4 C#4.txt GP 01 96 GREEN Account

Desired Output Output should be the dataframe with a new column ['Billing Code'] based on the criteria from one of the 427 dictionaries.

Client Short Name Source.Name User Bank Bank Account Number Account Description Billing Code
Client #1 C#1.txt AA 01 1 Test Account TNL
Client #1 C#1.txt AC 01 2 MY ACCOUNT MF
Client #1 C#1.txt AC 01 3 SUPER FUND HF
Client #1 C#1.txt AY 01 4 S&P INDEX Index
Client #1 C#1.txt AY 01 5 Test Account TNL
Client #1 C#1.txt AA 01 6 INDEX Index
Client #1 C#1.txt AA 01 7 Test Account TNL
Client #1 C#1.txt AA 01 8 RYAN'S Account HF
Client #2 C#2.txt BA 01 1 Test Account TNL
Client #2 C#2.txt BB 01 33 INDEX Index
Client #2 C#2.txt BB 01 92 Test Account TNL
Client #2 C#2.txt BZ 01 123123 INDEX Index
Client #3 C#3.txt BB 01 1657 Test Account TNL
Client #3 C#3.txt BP 01 15454 Test Account TNL
Client #4 C#4.txt GH 01 100 Test Account TNL
Client #4 C#4.txt GH 01 19875 INDEX Index
Client #4 C#4.txt GY 01 13579 Test Account TNL
Client #4 C#4.txt GE 01 2 INDEX Index
Client #4 C#4.txt GE 01 72 Test Account TNL
Client #4 C#4.txt GP 01 96 GREEN Account MF

Column names match keys.

I basically need to iterate through each row of the data and determine if it meets the criteria in the first dict. If it does then df['Billing Code'] = that specific dict['Billing Code'] if that makes sense. If not then move on to the next billing code.

Iteration could take a very long time to run through all of this hence the "Not Iterate" in the title. Not sure if this is something list comprehension can do.

Thank you for any help anyone can provide!

CodePudding user response:

EDIT:

Based on your comments, I'd first create a mapping ClientID -> List of Dictionaries:

lst = [
    {
        "client": "Client #1",
        "Billing Code": "TNL",
        "Bank": 1,
        "Account Number": 1,
    },
    {
        "client": "Client #1",
        "Billing Code": "MF",
        "User Bank": "AY",
        "Bank": 1,
    },
    {
        "Billing Code": "TNL",
        "client": "Client #2",
        "User Bank": "BB",
    },
]

# create a mapping client no. -> list of dictionaries
m = {}
for d in lst:
    m.setdefault(d["client"], []).append(d)
    d.pop("client")

Then I'd use df.groupby by Client ID and apply custom function:

def fn(x):
    dictionaries = m.get(x.name, [])

    out = []
    for _, row in x.iterrows():
        for d in dictionaries:
            if all(row[k] == v for k, v in d.items() if k != "Billing Code"):
                out.append(d["Billing Code"])
                break
        else:
            out.append("Unclassified")

    x["Billng Code"] = out
    return x


df = df.groupby("Client Short Name").apply(fn)
print(df)

The result is:

   Client Short Name Source.Name User Bank  Bank  Account Number Account Description   Billng Code
0          Client #1     C#1.txt        AA     1               1        Test Account           TNL
1          Client #1     C#1.txt        AC     1               2          MY ACCOUNT  Unclassified
2          Client #1     C#1.txt        AC     1               3          SUPER FUND  Unclassified
3          Client #1     C#1.txt        AY     1               4           S&P INDEX            MF
4          Client #1     C#1.txt        AY     1               5        Test Account            MF
5          Client #1     C#1.txt        AA     1               6               INDEX  Unclassified
6          Client #1     C#1.txt        AA     1               7        Test Account  Unclassified
7          Client #1     C#1.txt        AA     1               8      RYAN'S Account  Unclassified
8          Client #2     C#2.txt        BA     1               1        Test Account  Unclassified
9          Client #2     C#2.txt        BB     1              33               INDEX           TNL
10         Client #2     C#2.txt        BB     1              92        Test Account           TNL
11         Client #2     C#2.txt        BZ     1          123123               INDEX  Unclassified
12         Client #3     C#3.txt        BB     1            1657        Test Account  Unclassified
13         Client #3     C#3.txt        BP     1           15454        Test Account  Unclassified
14         Client #4     C#4.txt        GH     1             100        Test Account  Unclassified
15         Client #4     C#4.txt        GH     1           19875               INDEX  Unclassified
16         Client #4     C#4.txt        GY     1           13579        Test Account  Unclassified
17         Client #4     C#4.txt        GE     1               2               INDEX  Unclassified
18         Client #4     C#4.txt        GE     1              72        Test Account  Unclassified
19         Client #4     C#4.txt        GP     1              96       GREEN Account  Unclassified
  • Related