I'd like to sum the values of dataframe A
over axis=1
masked on the qcut
bins of dataframe B
computed over axis=1
.
A
and B
have identical indices and columns. The values of q
could be different to those given in the example. Dataframe B
can have NaN
s.
Example
import pandas as pd
import numpy as np
A = pd.DataFrame(
{'a': [4, 8, -2, -6, 7, -3, 4],
'b': [-8, 8, 6, 6, 5, -3, -2],
'e': [1, 34, -13, 5, 0, 8, -76],
'c': [1, 2, 6, 3, 8, -4, -20],
'd': [4, 0, 3, -4, 8, 4, 2]})
B = pd.DataFrame(
{'a': [34, 78, -2, -96, np.nan, -34, 44],
'b': [-82, 28, 96, 46, np.nan, -3, -20],
'e': [12, 324, -123, 56, np.nan, 8, -876],
'c': [np.nan, 28, 96, np.nan, 8, -34, -20],
'd': [42, -40, 23, -40, -50, 10, 97]})
q = [0, 0.33, 0.66, 1]
out =
0 1 2
0 -8.0 1.0 8.0
1 10.0 0.0 42.0
2 -15.0 3.0 12.0
3 -6.0 -4.0 11.0
4 8.0 0.0 8.0
5 -7.0 -3.0 12.0
6 -98.0 0.0 6.0
Approach
One possible approach which seems to work is,
bins = B.apply(lambda x: pd.qcut(x, q, labels=False), axis=1)
out = pd.concat(
[A.mask(bins!=x, np.nan).sum(axis=1) for x in range(len(q)-1)],
axis=1)
however this is very slow. The majority of time is used in applying the qcuts.
Timing
My dataframe is approximately 3000 rows by 500 columns and applying the qcuts takes 2.7 seconds with a further 130 milli-seconds taken in summing.
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
A = pd.DataFrame(np.random.rand(D,C), index=dt_index)
B = pd.DataFrame(np.random.rand(D,C), index=dt_index)
q = [0, 0.33, 0.66, 1]
%timeit bins = B.apply(lambda x: pd.qcut(x, q, labels=False), axis=1)
2.74 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit out = pd.concat([A.mask(bins!=x, np.nan).sum(axis=1) for x in range(len(q)-1)], axis=1)
131 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Is there a significantly faster way to achieve the same result?
CodePudding user response:
Overview and result
Interesting question. pd.qcut()
is a nice function. If you are looking for something faster, you might want to take a look at a numpy implementation. Unfortunately, numpy does not seem to have the exact same function. A combination of np.nanpercentile()
and np.digitize()
could work. But it handles NaNs and bins a little bit differently than Pandas. So you will need some extra (ugly) handling to make sure you have exactly the same result as the original implementation. I tested 3 implementations:
- Your implementation (variable
bins
): 1460 ms - Implementation with a numpy loop (variable
bins2
): 590 ms - Implementation with numpy list comprehension (variable
bins3
): 325 ms
To make sure the end result is the same, you can parse the resulting dataframes to a nullable Int64 type and use the df.equals()
function.
Numpy loop:
def makeBins2(df, q):
# Use nanpercentile as there are NaNs in the original dataframe
numpy_quantiles = np.nanpercentile(df, np.array(q)*100, axis=1).T
df_nparray = df.to_numpy()
myList = []
for i in range(0,len(df_nparray)):
row = df_nparray[i]
np_cut = numpy_quantiles[i]
# Correct for 0-start vs 1-start
result = np.digitize(row, np_cut, right=True) -1
# Correct for NaNs and left side
result = np.where(result>(len(q)-2), np.nan, result)
result = np.where(result<0, 0, result)
myList.append(result)
df_out = pd.DataFrame(myList, columns=list(df))
df_out.index = df.index
return df_out
Numpy list comprehension:
def row_function(row, q):
data = row[:-len(q)]
np_cut = row[-len(q):]
result = np.digitize(data, np_cut, right=True) -1
# Correct for NaNs and left side borders
result = np.where(result>(len(q)-2), np.nan, result)
result = np.where(result<0, 0, result)
return result
def makeBins3(df, q):
# Use nanpercentile as there are NaNs in the original dataframe
numpy_quantiles = np.nanpercentile(df, np.array(q)*100, axis=1).T
npArray = df.to_numpy()
# Make a single np matrix for row comprehension
npAll = np.concatenate((npArray, numpy_quantiles),axis=1)
# List comprehension
myList = np.array(([row_function(x,q) for x in npAll]))
df_out = pd.DataFrame(myList, columns=list(df))
df_out.index = df.index
return df_out
Check result:
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
A = pd.DataFrame(np.random.rand(D,C), index=dt_index)
B = pd.DataFrame(np.random.rand(D,C), index=dt_index)
q = [0, 0.33, 0.66, 1]
bins = B.apply(lambda x: pd.qcut(x, q, labels=False), axis=1)
bins2 = makeBins2(B,q)
bins3 = makeBins3(B,q)
print(bins.astype(pd.Int64Dtype()).equals(bins2.astype(pd.Int64Dtype())))
print(bins.astype(pd.Int64Dtype()).equals(bins3.astype(pd.Int64Dtype())))