Suppose we have a DataFrame df_example
with two columns (colA
and colB
), and 3 rows as shown in the following code:
df_example = pd.DataFrame({'colA': [10, 20, 30], 'colB': [40, 50, 60]})
print(df_example.head())
output:
colA colB
0 10 40
1 20 50
2 30 60
I need to retrieve multiple subsets from colA
based on boolean masks.
For example, assume we want to extract 5 subsets from colA
. Then we have 5 boolean masks, each mask contains 3 boolean elements (because colA
contains 3 values/rows). I store the masks in a matrix called mask_matrix
where each mask is stored as a row.
mask_matrix = np.array([
# we need to get 5 subsets (so we have 5 rows in the mask_matrix)
[True, False, True ], # 1st subset: get the 1st and 3rd value from colA,
[False, False, True ], # 2nd subset: get the 3rd value from colA,
[True, True, True ], # 3rd subset: get all values from colA,
[False, False, False], # 4th subset: get no values from colA,
[False, True, False] # 5th subset: get the 2nd value from colA,
])
I need to apply each mask (each row from mask_matrix
) to colA
and store the 5 results in a numpy array with dtype='object'
(because the returning results have different lengths).
I can do this task sequentially with the following code:
# I append the subsets here sequentially (this needs to be changed)
result = []
# I need to make this loop parallel
# And if possible, I need to get the result as a numpy array directly (not as a list)
for row in mask_matrix:
result.append(df_example[row]['colA'].values)
# converting the result from a list to a numpy array (probably we won't need this in parallel solution?)
result = np.array(result, dtype='object')
And then print the result
:
# printing the result (just for clarification)
print('result:', type(result))
for x in result:
print(' ', x)
print()
print('result.dtype:', result.dtype)
print('result.shape:', result.shape)
The output looks like this:
result: <class 'numpy.ndarray'>
[10 30]
[30]
[10 20 30]
[]
[20]
result.dtype: object
result.shape: (5,)
The output is correct. However, I want to make this code run faster. Instead of the sequential for loop in this line: for row in mask_matrix:
I want to vectorize the process and make it run in parallel (like numpy vectorized operations). Of course my example works on very small data, but in practice I will run this code on a large data with a large number of masks.
Is there a way to vectorize the operations performed by the for loop I mentioned? I prefer a way that uses numpy and/or pandas without any external libraries (if possible). I will appreciate any help.
CodePudding user response:
So in order to speed up the process I did the following:
import pandas as pd
import numpy as np
import time
df_example = pd.DataFrame({'colA': [10, 20, 30], 'colB': [40, 50, 60]})
print(df_example.head())
mask_matrix = np.array([
# we need to get 5 subsets (so we have 5 rows in the mask_matrix)
[True, False, True ], # 1st subset: get the 1st and 3rd value from colA,
[False, False, True ], # 2nd subset: get the 3rd value from colA,
[True, True, True ], # 3rd subset: get all values from colA,
[False, False, False], # 4th subset: get no values from colA,
[False, True, False] # 5th subset: get the 2nd value from colA,
])
# I append the subsets here sequentially (this needs to be changed)
result = []
# I need to make this loop parallel
# And if possible, I need to get the result as a numpy array directly (not as a list)
start = time.process_time()
for row in mask_matrix:
result.append(df_example[row]['colA'].values)
print("Baseline: {}".format( time.process_time() - start))
#====================================
#NEW CODE HERE
#Convert dataframe to numpy array
df_matrix = df_example.to_numpy().T
#Loop through all columns if desired
for column_idx in range(df_matrix.shape[0]):
start = time.process_time()
values = np.multiply(df_matrix[column_idx], mask_matrix)
print("Vectorized: {}".format(time.process_time() - start))
break
print(values.shape)
print(values)
This returns the following
Baseline: 0.0007300000000000084
Vectorized: 4.500000000007276e-05
(5, 3)
[[10 0 30]
[ 0 0 30]
[10 20 30]
[ 0 0 0]
[ 0 20 0]]
From here there are a few approaches, instead of using False in the Boolean matrix, you could use np.nan which yields the following:
[[10. nan 30.]
[nan nan 30.]
[10. 20. 30.]
[nan nan nan]
[nan 20. nan]]
If you wanted to remove the nans, you would have to loop through, but I believe this would be inefficient.