Sum values of Dataframe based on qcuts of a second Dataframe over axis 1-CodePudding

I'd like to sum the values of dataframe A over axis=1 masked on the qcut bins of dataframe B computed over axis=1.

A and B have identical indices and columns. The values of q could be different to those given in the example. Dataframe B can have NaNs.

Example

import pandas as pd
import numpy as np
A = pd.DataFrame(
    {'a': [4, 8, -2, -6, 7, -3, 4],
     'b': [-8, 8, 6, 6, 5, -3, -2],
     'e': [1, 34, -13, 5, 0, 8, -76],
     'c': [1, 2, 6, 3, 8, -4, -20],
     'd': [4, 0, 3, -4, 8, 4, 2]})
B = pd.DataFrame(
    {'a': [34, 78, -2, -96, np.nan, -34, 44],
     'b': [-82, 28, 96, 46, np.nan, -3, -20],
     'e': [12, 324, -123, 56, np.nan, 8, -876],
     'c': [np.nan, 28, 96, np.nan, 8, -34, -20],
     'd': [42, -40, 23, -40, -50, 10, 97]})
q = [0, 0.33, 0.66, 1]

out = 
      0    1     2
0  -8.0  1.0   8.0
1  10.0  0.0  42.0
2 -15.0  3.0  12.0
3  -6.0 -4.0  11.0
4   8.0  0.0   8.0
5  -7.0 -3.0  12.0
6 -98.0  0.0   6.0

Approach

One possible approach which seems to work is,

bins = B.apply(lambda x: pd.qcut(x, q, labels=False), axis=1)
out = pd.concat(
      [A.mask(bins!=x, np.nan).sum(axis=1) for x in range(len(q)-1)],
      axis=1)

however this is very slow. The majority of time is used in applying the qcuts.

Timing

My dataframe is approximately 3000 rows by 500 columns and applying the qcuts takes 2.7 seconds with a further 130 milli-seconds taken in summing.

import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
A = pd.DataFrame(np.random.rand(D,C), index=dt_index)
B = pd.DataFrame(np.random.rand(D,C), index=dt_index)
q = [0, 0.33, 0.66, 1]

%timeit bins = B.apply(lambda x: pd.qcut(x, q, labels=False), axis=1)
2.74 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit out = pd.concat([A.mask(bins!=x, np.nan).sum(axis=1) for x in range(len(q)-1)], axis=1)
131 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Is there a significantly faster way to achieve the same result?

CodePudding user response：

Overview and result

Interesting question. pd.qcut() is a nice function. If you are looking for something faster, you might want to take a look at a numpy implementation. Unfortunately, numpy does not seem to have the exact same function. A combination of np.nanpercentile() and np.digitize() could work. But it handles NaNs and bins a little bit differently than Pandas. So you will need some extra (ugly) handling to make sure you have exactly the same result as the original implementation. I tested 3 implementations:

Your implementation (variable bins): 1460 ms
Implementation with a numpy loop (variable bins2): 590 ms
Implementation with numpy list comprehension (variable bins3): 325 ms

To make sure the end result is the same, you can parse the resulting dataframes to a nullable Int64 type and use the df.equals() function.

Numpy loop:

def makeBins2(df, q):
    # Use nanpercentile as there are NaNs in the original dataframe
    numpy_quantiles = np.nanpercentile(df, np.array(q)*100, axis=1).T
    df_nparray = df.to_numpy()
    
    myList = []
    for i in range(0,len(df_nparray)):
        row = df_nparray[i]
        np_cut = numpy_quantiles[i]
        # Correct for 0-start vs 1-start
        result = np.digitize(row, np_cut, right=True) -1
        
        # Correct for NaNs and left side
        result = np.where(result>(len(q)-2), np.nan, result)
        result = np.where(result<0, 0, result)
        myList.append(result)
        
        
    df_out = pd.DataFrame(myList, columns=list(df))
    df_out.index = df.index
    return df_out

Numpy list comprehension:

def row_function(row, q):
    data = row[:-len(q)]
    np_cut = row[-len(q):]
    result = np.digitize(data, np_cut, right=True) -1
    
    # Correct for NaNs and left side borders
    result = np.where(result>(len(q)-2), np.nan, result)
    result = np.where(result<0, 0, result)
    return result
    
    
def makeBins3(df, q):
    # Use nanpercentile as there are NaNs in the original dataframe
    numpy_quantiles = np.nanpercentile(df, np.array(q)*100, axis=1).T
    npArray = df.to_numpy()
    
    # Make a single np matrix for row comprehension   
    npAll = np.concatenate((npArray, numpy_quantiles),axis=1)
    
    # List comprehension
    myList = np.array(([row_function(x,q) for x in npAll]))
               
    df_out = pd.DataFrame(myList, columns=list(df))
    df_out.index = df.index
    return df_out

Check result:

import numpy as np
import pandas as pd 

D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
A = pd.DataFrame(np.random.rand(D,C), index=dt_index)
B = pd.DataFrame(np.random.rand(D,C), index=dt_index)
q = [0, 0.33, 0.66, 1]

bins = B.apply(lambda x: pd.qcut(x, q, labels=False), axis=1)
bins2 = makeBins2(B,q)
bins3 = makeBins3(B,q)

print(bins.astype(pd.Int64Dtype()).equals(bins2.astype(pd.Int64Dtype())))
print(bins.astype(pd.Int64Dtype()).equals(bins3.astype(pd.Int64Dtype())))