apply() custom function on all columns increase efficiency-CodePudding

I apply this function

def calculate_recency_for_one_column(column: pd.Series) -> int:
    """Returns the inverse position of the last non-zero value in a pd.Series of numerics.
    If the last value is non-zero, returns 1. If all values are non-zero, returns 0."""
    non_zero_values_of_col = column[column.astype(bool)]
    if non_zero_values_of_col.empty:
        return 0
    return len(column) - non_zero_values_of_col.index[-1]

to all columns of this example dataframe

df = pd.DataFrame(np.random.binomial(n=1, p=0.001, size=[1000000]).reshape((1000,1000)))

by using

df.apply(lambda column: calculate_recency_for_one_column(column),axis=0)

The result is:

0      436
1        0
2      624
3        0
      ... 
996    155
997    715
998    442
999    163
Length: 1000, dtype: int64

Everything works fine, but my programm has to do this operation often, so I need a more efficient alternative. Does anybody have an idea how to make this faster? I think calculate_recency_for_one_column() is efficient enough and the df.apply() has the most potential for improvement. Here a as benchmark (100 reps):

>> timeit.timeit(lambda: df.apply(lambda column: calculate_recency_for_one_column(column),axis=0), number=100)
14.700050864834338

Update

Mustafa's answer:

>> timeit.timeit(lambda: pd.Series(np.where(df.eq(0).all(), 0, len(df) - df[::-1].idxmax())), number=100)
0.8847485752776265

padu's answer:

>> timeit.timeit(lambda: df.apply(calculate_recency_for_one_column_numpy, raw=True, axis=0), number=100)
0.8892530500888824

CodePudding user response：

You can treat columns not as Series objects but as numpy arrays. To do this, simply specify the raw=True parameter in the apply method. also need to slightly change the original function.

import time

import numpy as np
import pandas as pd


def calculate_recency_for_one_column(column: np.ndarray) -> int:
    """Returns the inverse position of the last non-zero value in a np.ndarray of numerics.
    If the last value is non-zero, returns 1. If all values are non-zero, returns 0."""
    non_zero_values_of_col = np.nonzero(column)[0]
    if not non_zero_values_of_col.any():
        return 0
    return len(column) - non_zero_values_of_col[-1]


df = pd.DataFrame(np.random.binomial(n=1, p=0.001, size=[1000000]).reshape((1000,1000)))


start = time.perf_counter()
res = df.apply(calculate_recency_for_one_column, raw=True)
print(f'time took {time.perf_counter() - start:.3f} s.')

Out:
    0.005 s.

CodePudding user response：

np.where is vectorized if-else, so:

np.where(df.eq(0).all(), 0, len(df) - df[::-1].idxmax())

for a given column, do its values all equal to 0?
- if so, put 0 to the result
- else, get the index of the last 1 (hence the reversal with [::-1] and rsub it from len(df)

a timing comparison:

In [261]: %timeit np.where(df.eq(0).all(), 0, len(df) - df[::-1].idxmax())
10.6 ms ± 338 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [262]: %timeit df.apply(calculate_recency_for_one_column)
180 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

and sanity check:

In [263]: (np.where(df.eq(0).all(), 0, len(df) - df[::-1].idxmax())
                 == df.apply(calculate_recency_for_one_column)).all()
Out[263]: True