Home > Back-end >  How to loop through a 3-D array and multiple dataframes?
How to loop through a 3-D array and multiple dataframes?

Time:05-27

Back ground info

multi_data is a 3d array , (10,5,5) array. For this example multi_data = np.arange(250).reshape(10,5,5) Each of the 10 matrices have 5X5 states (A-E). Each of the matrices are in order and represent time in years in increments of 1. Starting from matrices[0] which contains the matrix values for year 1, up to matrices[9] year 10.

Example of multi_data at year 1 multi_data[0]

array([[[  0,   1,   2,   3,   4],
        [  5,   6,   7,   8,   9],
        [ 10,  11,  12,  13,  14],
        [ 15,  16,  17,  18,  19],
        [ 20,  21,  22,  23,  24]],

Customers usually make a purchase within a few years (not immediately on signup), for example this customer made a purchase in year 3. Hence the matrix calculation for this customer starts at year 3. Each user has a current_state (A-E) and I need to transform the user data so that I can multiply it by the matrices. For example user customer1 has a current state B, hence the amount is the second element in the array customer1= np.array([0, 1000, 0, 0, 0])

dataframe 1 (customers)

cust_id|state|amount|year|
1      |   B | 1000 | 3
2      |   D | 500  | 2


multi_data = np.arange(250).reshape(10,5,5)
customer1= np.array([0, 1000, 0, 0, 0])
output = customer1
results = []
for arr in multi_data[3:4]: #customer purchases at year 3 hence I am multiplying customer1 by matrix at year 3
    output = output@arr
    results.append(output)

example of output results = [array([80000, 81000, 82000, 83000, 84000])]

I then need to multiply the results by dataframe 2

dataframe_2

| year  | lim %
|   1   |  0.19
|   2   |  0.11
|   3   |  0.02
|   10  |  0.23

so I multiply the results by lim % at year 3.

dataframe2=dataframe2.loc[dataframe2['year'] == 3]

results=dataframe2['LimitPerc'].values * results

example output results

[array([1600,1620,1640,1660,1680])]

I then need to multiply these results by matrix year 4 and then by lim% year 4 , until year 10 is reached.

like this:

customer1= [array([1600,1620,1640,1660,1680])]
output = customer1
results = []
for arr in data[4:5]: #multiplying by year 4 matrix (multi_data)
    output = output@arr
    results.append(output)

dataframe2=dataframe2.loc[dataframe2['year'] == 4]

results=dataframe2['LimitPerc'].values * results

Is there an easier way to do this that is less manual?. I need to continue this calculation until year 10, for each customer.I need to save the results for each customer, after every calculation.

Additional info: I am currently running through all customers years like below, but my problem is I have a lot of vlookup type calculations like dataframe2 that needs to be calculated inbetween each year for each customer, and I have to save the results for each customer after each calculation.

results_dict = {}
for _id, c, y in zip(cust_id ,cust_amt, year):
    results = []
    for m in multi_data[y:]:
        c = c @ m
        results.append(c)
    results_dict[_id] = results

CodePudding user response:

Unfortunately, since you need all the intermediate results, I don't think it's possible to optimise this much, so you will need the loop. If you didn't need the intermediate results, you could precompute the matrix product for each year up to year 10. However, here it is not useful.

To integrate the look-ups in your loop, you could just put all the dataframes in a list and use the DataFrame index to query the values. Also, you can convert the state to an integer index. Note that you don't need to create the customer1 vector. Since it's non-zero only in one position, you can directly extract the relevant row of the matrix and multiply it by amount.

Sample data:

import pandas as pd
import numpy as np

customer_data = pd.DataFrame({"cust_id": [1, 2, 3, 4, 5, 6, 7, 8],
                              "state": ['B', 'E', 'D', 'A', 'B', 'E', 'C', 'A'],
                              "cust_amt": [1000,300, 500, 200, 400, 600, 200, 300],
                              "year":[3, 3, 4, 3, 4, 2, 2, 4],
                              "group":[10, 25, 30, 40, 55, 60, 70, 85]})

state_list = ['A','B','C','D','E']

# All lookups should be dataframes with the year and/or group and the value like these.
lookup1 = pd.DataFrame({'year': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                       'lim %': 0.1})
lookup2 = pd.concat([pd.DataFrame({'group':g, 'lookup_val': 0.1, 'year':range(1, 11)} 
                                  for g in customer_data['group'].unique())]).explode('year')

multi_data = np.arange(250).reshape(10,5,5)

Preprocessing:

# Put all lookups in order of calculation in this list.
lookups = [lookup1, lookup2]

# Preprocessing.
# Transform the state to categorical code to use it as array index.
customer_data['state'] = pd.Categorical(customer_data['state'], 
                                        categories=state_list, 
                                        ordered=True).codes

# Set index on lookups.
for i in range(len(lookups)):
    if 'group' in lookups[i].columns:
        lookups[i] = lookups[i].set_index(['year', 'group'])
    else:
        lookups[i] = lookups[i].set_index(['year'])

Calculating results:

results = {}
for customer, state, amount, start, group in customer_data.itertuples(name=None, index=False):
    for year in range(start, len(multi_data) 1):
        if year == start:
            results[customer] = [[amount * multi_data[year-1, state, :]]]
        else:
            results[customer].append([results[customer][-1][-1] @ multi_data[year-1]])
                
        for lookup in lookups:
            if isinstance(lookup.index, pd.MultiIndex):
                value = lookup.loc[(year, group)].iat[0]
            else:
                value = lookup.loc[year].iat[0]
            results[customer][-1].append(value * results[customer][-1][-1])

Accessing the results:

# Here are examples of how you obtain the results from the dictionary:

# Customer 1 at start year, first result.
results[1][0][0]

# Customer 1 at start year, second result (which would be after lookup1 here).
results[1][0][1]

# Customer 1 at start year, third result (which would be after lookup2 here).
results[1][0][2]

# Customer 1 at year start 1, first result.
results[1][1][0]

# ...

# Customer c at year start y, result k 1.
results[c][y][k]
  • Related