Assume I have data in the form (As a Pandas' Data Frame):
Index | ID | Value | Div Factor | Weighted Sum |
---|---|---|---|---|
1 | 1 | 2 | 1 | |
2 | 1 | 3 | 2 | |
3 | 2 | 6 | 1 | |
4 | 1 | 1 | 3 | |
5 | 2 | 3 | 2 | |
6 | 2 | 9 | 3 | |
7 | 2 | 8 | 4 | |
8 | 3 | 5 | 1 | |
9 | 3 | 6 | 2 | |
10 | 1 | 8 | 4 | |
11 | 3 | 2 | 3 | |
12 | 3 | 7 | 4 |
I want to calculate the column Weighted Sum
as following (For the $i$ -th row):
- Look at all values from row 1 to
i
. - Sum values by groups according to the
ID
value of each row. So we havek
sum values wherek
is the number of uniqueID
value from the row 1 toi
. - Divide each sum (There are
k
sum values) by the number of elements in the group. - Sum those
k
values and divide byk
(The average of the averages).
For example, let's do rows 1, 7 and 12:
Row 1
For i = 1
we have a single value hence the sum is 2 and the average of the single group is 2 and average over all groups is 2.
Row 7
For i = 7
we have only 2 unique values of ID
above it: 1 and 2.
For the group of ID = 1
we have: (1 3 2) / 3 = 2.
For the group of ID = 2
we have: (8 9 3 6) / 4 = 6.5.
Then the average of averages is (2 6.5) / 2 = 4.25.
Row 12
For i = 12
we have 3 unique ID
values on the rows 1:12
.
For the group of ID = 1
we have: (8 1 3 2) / 4 = 3.5.
For the group of ID = 2
we have: (8 9 3 6) / 4 = 6.5.
For the group of ID = 3
we have: (7 2 6 5) / 4 = 5.
Then the average of averages is (3.5 6.5 5) / 3 = 5.
Remark: The method should be feasible for the case of ~1e7
rows and ~1e6
unique ID's.
As a follow up to Apply a Function per Row of Sub Groups of the Data **Above** the Current Row, I got a good answer yet it allocates a lot of memory in case the number of rows and the number of unique ID
are high.
I was wondering if there is a way to create a much smaller auxiliary data using explicit loops while accelerating them using Numba
.
The idea is to have similar or better performance while reducing the memory footprint considerably.
CodePudding user response:
Here's a benchmark showing the performance of pandas, numpy and numba/numpy solutions at various row counts (12 to 36,000) and unique ID counts (3 to 9,000):
rows 12, unique ID values: 3
Timeit results:
foo_1 (pandas) ran in 0.003095399937592447 seconds using 1 iterations
foo_2 (numpy) ran in 0.0003358999965712428 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.00018770003225654364 seconds using 1 iterations
rows 120, unique ID values: 30
Timeit results:
foo_1 (pandas) ran in 0.0024368000449612737 seconds using 1 iterations
foo_2 (numpy) ran in 0.001127400086261332 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.00029390002600848675 seconds using 1 iterations
rows 1200, unique ID values: 300
Timeit results:
foo_1 (pandas) ran in 0.01624089991673827 seconds using 1 iterations
foo_2 (numpy) ran in 0.009926999919116497 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.002144100028090179 seconds using 1 iterations
rows 12000, unique ID values: 3000
Timeit results:
foo_1 (pandas) ran in 2.391147599904798 seconds using 1 iterations
foo_2 (numpy) ran in 0.2884287000633776 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.1226186000276357 seconds using 1 iterations
rows 36000, unique ID values: 9000
Timeit results:
foo_1 (pandas) ran in 44.33448620000854 seconds using 1 iterations
foo_2 (numpy) ran in 3.0259654000401497 seconds using 1 iterations
foo_3 (numpy_numba) ran in 1.6273660999722779 seconds using 1 iterations
The pandas solution creates an intermediate dataframe that is num IDs x num rows
in size. The numpy and numpy/numba solutions calculate results column by column, so they create a handful of intermediate 1D arrays of length num rows
. The numpy/numba solution is consistently 2-5 times faster than numpy, and pandas is 2-10 times slower than numpy.
Upping the size a bit more gives the following result (where the pandas solution is commented out):
rows 120000, unique ID values: 30000
Timeit results:
foo_1 (pandas) ran in 6.00004568696022e-06 seconds using 1 iterations
foo_2 (numpy) ran in 28.882483799941838 seconds using 1 iterations
foo_3 (numpy_numba) ran in 38.77682559995446 seconds using 1 iterations
So it appears that there is a threshold above which the numpy/numba solution loses ground to regular numpy.
Full test code:
import pandas as pd
# insert code to initialize dfInit here
print(dfInit)
'''
Index ID Value Div Factor
1 1 2 1
2 1 3 2
3 2 6 1
4 1 1 3
5 2 3 2
6 2 9 3
7 2 8 4
8 3 5 1
9 3 6 2
10 1 8 4
11 3 2 3
12 3 7 4
'''
def initDf(colMult=1):
df = dfInit.copy()
dfMult = pd.concat([df.assign(ID=dfInit.ID 3*i, Index=df.Index len(df)*i) for i in range(colMult)], axis=0).reset_index(drop=True)
print(f'\nrows {len(dfMult)}, unique ID values: {len(dfMult.ID.unique())}')
return dfMult
df = initDf()
def pd_foo_1(df):
df1 = df[['ID', 'Value']].set_index('ID', append=True).unstack(-1)
df2 = df1.fillna(0).cumsum() / df1.notnull().astype(int).cumsum()
df['Weighted Sum'] = df2.mean(axis=1)
return df
def foo_1(df):
#return None
try:
return pd_foo_1(df)
except (ValueError):
print('overflow encountered')
return None
import numpy as np
def foo_2(df):
values = df.Value.to_numpy()
ids = df.ID.to_numpy()
uniqIds = df.ID.unique()
aggSumsAcrossIds = np.zeros(values.shape)
aggCntsAcrossIds = np.zeros(values.shape)
for id in uniqIds:
curCounts = (ids == id)
cumCounts = np.cumsum(curCounts)
curValues = values.copy()
curValues[~curCounts] = 0
cumValues = np.cumsum(curValues)
aggSumsAcrossIds = cumValues / (cumCounts (cumCounts == 0))
curHasAppeared = cumCounts > 0
aggCntsAcrossIds = curHasAppeared
weightedSum = aggSumsAcrossIds / aggCntsAcrossIds
df['Weighted Sum'] = weightedSum
return df
from numba import njit
@njit
def np_foo_3(values, ids):
uniqIds = np.unique(ids)
aggSumsAcrossIds = np.zeros(values.shape)
aggCntsAcrossIds = np.zeros(values.shape)
for id in uniqIds:
curCounts = (ids == id)
cumCounts = np.cumsum(curCounts)
curValues = values.copy()
curValues[~curCounts] = 0
cumValues = np.cumsum(curValues)
aggSumsAcrossIds = cumValues / (cumCounts (cumCounts == 0))
curHasAppeared = cumCounts > 0
aggCntsAcrossIds = curHasAppeared
weightedSum = aggSumsAcrossIds / aggCntsAcrossIds
return weightedSum
def foo_3(df):
values = df.Value.to_numpy()
ids = df.ID.to_numpy()
weightedSum = np_foo_3(values, ids)
df['Weighted Sum'] = weightedSum
return df
foo_count = 3
foo_names=['foo_' str(i 1) for i in range(foo_count)]
foo_labels=['pandas', 'numpy', 'numpy_numba']
exec("foo_funcs=[" ','.join(f"foo_{str(i 1)}" for i in range(foo_count)) "]")
for foo in foo_names:
print(f'{foo} output:')
#print(eval(f"{foo}(df)"))
eval(f"{foo}(df)"); print("... output suppressed.")
# ===================== BENCHMARK with timeit:
from timeit import timeit
n = 1
for colMult in [1,10,100,1000,3000,10000]:
df = initDf(colMult)
print(f'Timeit results:')
for i, foo in enumerate(foo_names):
t = timeit(f"{foo}(df)", setup=f"from __main__ import df, {foo}", number=n) / n
print(f'{foo} ({foo_labels[i]}) ran in {t} seconds using {n} iterations')
# ===================== ... END BENCHMARK with timeit.
Space used:
The memory of a pandas solution which pivots IDs is proportional to num rows x num unique IDs
. By comparison, a solution that loops over IDs processing one copy of the Value
column at a time uses memory proportional to num rows
.
This means that 10^7 rows x 10^6 unique IDs or about 10^13 4-8 byte values (call it 10^14 bytes, or 100,000 GB) is not feasible storing the pivot table in program memory with pandas.
However, 10^7 rows of at most 10 1-D arrays of 8-byte values uses on the order of 10^9 bytes or 1 GB of program memory in the looping solutions (numpy or numpy/numba above).
Note that adding a nested loop over chunks of a fixed number of rows will allow us to cap the memory usage of most of the roughly 10 1-D arrays mentioned above, but we will still need to calculate at least 1 array (the result), so this will never give us more than a factor of 10 reduction in memory footprint.