Optimizing Memory Allocations of Pandas Code to Process Rows Using Explicit Loops with Numba Optimiz-CodePudding

Assume I have data in the form (As a Pandas' Data Frame):

Index	ID	Value	Div Factor
1	1	2	1
2	1	3	2
3	2	6	1
4	1	1	3
5	2	3	2
6	2	9	3
7	2	8	4
8	3	5	1
9	3	6	2
10	1	8	4
11	3	2	3
12	3	7	4

I want to calculate the column Weighted Sum as following (For the $i$ -th row):

Look at all values from row 1 to i.
Sum values by groups according to the ID value of each row. So we have k sum values where k is the number of unique ID value from the row 1 to i.
Divide each sum (There are k sum values) by the number of elements in the group.
Sum those k values and divide by k (The average of the averages).

For example, let's do rows 1, 7 and 12:

Row 1

For i = 1 we have a single value hence the sum is 2 and the average of the single group is 2 and average over all groups is 2.

Row 7

For i = 7 we have only 2 unique values of ID above it: 1 and 2.
For the group of ID = 1 we have: (1 3 2) / 3 = 2.
For the group of ID = 2 we have: (8 9 3 6) / 4 = 6.5.
Then the average of averages is (2 6.5) / 2 = 4.25.

Row 12

For i = 12 we have 3 unique ID values on the rows 1:12.
For the group of ID = 1 we have: (8 1 3 2) / 4 = 3.5.
For the group of ID = 2 we have: (8 9 3 6) / 4 = 6.5.
For the group of ID = 3 we have: (7 2 6 5) / 4 = 5.
Then the average of averages is (3.5 6.5 5) / 3 = 5.

Remark: The method should be feasible for the case of ~1e7 rows and ~1e6 unique ID's.

As a follow up to Apply a Function per Row of Sub Groups of the Data **Above** the Current Row, I got a good answer yet it allocates a lot of memory in case the number of rows and the number of unique ID are high.

I was wondering if there is a way to create a much smaller auxiliary data using explicit loops while accelerating them using Numba.

The idea is to have similar or better performance while reducing the memory footprint considerably.

CodePudding user response：

Here's a benchmark showing the performance of pandas, numpy and numba/numpy solutions at various row counts (12 to 36,000) and unique ID counts (3 to 9,000):

rows 12, unique ID values: 3
Timeit results:
foo_1 (pandas) ran in 0.003095399937592447 seconds using 1 iterations
foo_2 (numpy) ran in 0.0003358999965712428 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.00018770003225654364 seconds using 1 iterations

rows 120, unique ID values: 30
Timeit results:
foo_1 (pandas) ran in 0.0024368000449612737 seconds using 1 iterations
foo_2 (numpy) ran in 0.001127400086261332 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.00029390002600848675 seconds using 1 iterations

rows 1200, unique ID values: 300
Timeit results:
foo_1 (pandas) ran in 0.01624089991673827 seconds using 1 iterations
foo_2 (numpy) ran in 0.009926999919116497 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.002144100028090179 seconds using 1 iterations

rows 12000, unique ID values: 3000
Timeit results:
foo_1 (pandas) ran in 2.391147599904798 seconds using 1 iterations
foo_2 (numpy) ran in 0.2884287000633776 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.1226186000276357 seconds using 1 iterations

rows 36000, unique ID values: 9000
Timeit results:
foo_1 (pandas) ran in 44.33448620000854 seconds using 1 iterations
foo_2 (numpy) ran in 3.0259654000401497 seconds using 1 iterations
foo_3 (numpy_numba) ran in 1.6273660999722779 seconds using 1 iterations

The pandas solution creates an intermediate dataframe that is num IDs x num rows in size. The numpy and numpy/numba solutions calculate results column by column, so they create a handful of intermediate 1D arrays of length num rows. The numpy/numba solution is consistently 2-5 times faster than numpy, and pandas is 2-10 times slower than numpy.

Upping the size a bit more gives the following result (where the pandas solution is commented out):

rows 120000, unique ID values: 30000
Timeit results:
foo_1 (pandas) ran in 6.00004568696022e-06 seconds using 1 iterations
foo_2 (numpy) ran in 28.882483799941838 seconds using 1 iterations
foo_3 (numpy_numba) ran in 38.77682559995446 seconds using 1 iterations

So it appears that there is a threshold above which the numpy/numba solution loses ground to regular numpy.

Full test code:

import pandas as pd
# insert code to initialize dfInit here
print(dfInit)
'''
Index   ID  Value   Div Factor
1   1   2   1   
2   1   3   2   
3   2   6   1   
4   1   1   3   
5   2   3   2   
6   2   9   3   
7   2   8   4   
8   3   5   1   
9   3   6   2   
10  1   8   4   
11  3   2   3   
12  3   7   4
'''

def initDf(colMult=1):
    df = dfInit.copy()
    dfMult = pd.concat([df.assign(ID=dfInit.ID   3*i, Index=df.Index   len(df)*i) for i in range(colMult)], axis=0).reset_index(drop=True)
    print(f'\nrows {len(dfMult)}, unique ID values: {len(dfMult.ID.unique())}')
    return dfMult
df = initDf()

def pd_foo_1(df):
    df1 = df[['ID', 'Value']].set_index('ID', append=True).unstack(-1)
    df2 = df1.fillna(0).cumsum() / df1.notnull().astype(int).cumsum()
    df['Weighted Sum'] = df2.mean(axis=1)
    return df
def foo_1(df):
    #return None
    try:
        return pd_foo_1(df)
    except (ValueError):
        print('overflow encountered')
        return None

import numpy as np

def foo_2(df):
    values = df.Value.to_numpy()
    ids = df.ID.to_numpy()
    uniqIds = df.ID.unique()
    aggSumsAcrossIds = np.zeros(values.shape)
    aggCntsAcrossIds = np.zeros(values.shape)

    for id in uniqIds:
        curCounts = (ids == id)
        cumCounts = np.cumsum(curCounts)
        curValues = values.copy()
        curValues[~curCounts] = 0
        cumValues = np.cumsum(curValues)
        aggSumsAcrossIds  = cumValues / (cumCounts   (cumCounts == 0))
        curHasAppeared = cumCounts > 0
        aggCntsAcrossIds  = curHasAppeared
    weightedSum = aggSumsAcrossIds / aggCntsAcrossIds
    df['Weighted Sum'] = weightedSum 
    return df
from numba import njit
@njit
def np_foo_3(values, ids):
    uniqIds = np.unique(ids)
    aggSumsAcrossIds = np.zeros(values.shape)
    aggCntsAcrossIds = np.zeros(values.shape)
    for id in uniqIds:
        curCounts = (ids == id)
        cumCounts = np.cumsum(curCounts)
        curValues = values.copy()
        curValues[~curCounts] = 0
        cumValues = np.cumsum(curValues)
        aggSumsAcrossIds  = cumValues / (cumCounts   (cumCounts == 0))
        curHasAppeared = cumCounts > 0
        aggCntsAcrossIds  = curHasAppeared
    weightedSum = aggSumsAcrossIds / aggCntsAcrossIds
    return weightedSum
def foo_3(df):
    values = df.Value.to_numpy()
    ids = df.ID.to_numpy()
    weightedSum = np_foo_3(values, ids)
    df['Weighted Sum'] = weightedSum 
    return df

foo_count = 3
foo_names=['foo_'   str(i   1) for i in range(foo_count)]
foo_labels=['pandas', 'numpy', 'numpy_numba']
exec("foo_funcs=["   ','.join(f"foo_{str(i   1)}" for i in range(foo_count))   "]")
for foo in foo_names:
    print(f'{foo} output:')
    #print(eval(f"{foo}(df)"))
    eval(f"{foo}(df)"); print("... output suppressed.")


# ===================== BENCHMARK with timeit:
from timeit import timeit
n = 1
for colMult in [1,10,100,1000,3000,10000]:
    df = initDf(colMult)
    print(f'Timeit results:')
    for i, foo in enumerate(foo_names):
        t = timeit(f"{foo}(df)", setup=f"from __main__ import df, {foo}", number=n) / n
        print(f'{foo} ({foo_labels[i]}) ran in {t} seconds using {n} iterations')
# ===================== ... END BENCHMARK with timeit.

Space used:

The memory of a pandas solution which pivots IDs is proportional to num rows x num unique IDs. By comparison, a solution that loops over IDs processing one copy of the Value column at a time uses memory proportional to num rows.

This means that 10^7 rows x 10^6 unique IDs or about 10^13 4-8 byte values (call it 10^14 bytes, or 100,000 GB) is not feasible storing the pivot table in program memory with pandas.

However, 10^7 rows of at most 10 1-D arrays of 8-byte values uses on the order of 10^9 bytes or 1 GB of program memory in the looping solutions (numpy or numpy/numba above).

Note that adding a nested loop over chunks of a fixed number of rows will allow us to cap the memory usage of most of the roughly 10 1-D arrays mentioned above, but we will still need to calculate at least 1 array (the result), so this will never give us more than a factor of 10 reduction in memory footprint.