Home > Software engineering >  How to select multiple subsets from a pandas DataFrame column using multiple boolean masks in parall
How to select multiple subsets from a pandas DataFrame column using multiple boolean masks in parall

Time:10-13

Suppose we have a DataFrame df_example with two columns (colA and colB), and 3 rows as shown in the following code:

df_example = pd.DataFrame({'colA': [10, 20, 30], 'colB': [40, 50, 60]})
print(df_example.head())

output:

   colA  colB
0    10    40
1    20    50
2    30    60

I need to retrieve multiple subsets from colA based on boolean masks.

For example, assume we want to extract 5 subsets from colA. Then we have 5 boolean masks, each mask contains 3 boolean elements (because colA contains 3 values/rows). I store the masks in a matrix called mask_matrix where each mask is stored as a row.

mask_matrix = np.array([
    # we need to get 5 subsets (so we have 5 rows in the mask_matrix)
    [True,  False, True ], # 1st subset: get the 1st and 3rd value from colA,
    [False, False, True ], # 2nd subset: get the 3rd value from colA,
    [True,  True,  True ], # 3rd subset: get all values from colA,
    [False, False, False], # 4th subset: get no values from colA,
    [False, True,  False]  # 5th subset: get the 2nd value from colA,
])

I need to apply each mask (each row from mask_matrix) to colA and store the 5 results in a numpy array with dtype='object' (because the returning results have different lengths).

I can do this task sequentially with the following code:

# I append the subsets here sequentially (this needs to be changed)
result = []

# I need to make this loop parallel
# And if possible, I need to get the result as a numpy array directly (not as a list)
for row in mask_matrix:
    result.append(df_example[row]['colA'].values)

# converting the result from a list to a numpy array (probably we won't need this in parallel solution?)
result = np.array(result, dtype='object')

And then print the result:

# printing the result (just for clarification)
print('result:', type(result))
for x in result:
    print('  ', x)
    
print()
print('result.dtype:', result.dtype)
print('result.shape:', result.shape)

The output looks like this:

result: <class 'numpy.ndarray'>
   [10 30]
   [30]
   [10 20 30]
   []
   [20]

result.dtype: object
result.shape: (5,)

The output is correct. However, I want to make this code run faster. Instead of the sequential for loop in this line: for row in mask_matrix: I want to vectorize the process and make it run in parallel (like numpy vectorized operations). Of course my example works on very small data, but in practice I will run this code on a large data with a large number of masks.

Is there a way to vectorize the operations performed by the for loop I mentioned? I prefer a way that uses numpy and/or pandas without any external libraries (if possible). I will appreciate any help.

CodePudding user response:

So in order to speed up the process I did the following:

import pandas as pd
import numpy as np
import time
df_example = pd.DataFrame({'colA': [10, 20, 30], 'colB': [40, 50, 60]})
print(df_example.head())

mask_matrix = np.array([
    # we need to get 5 subsets (so we have 5 rows in the mask_matrix)
    [True,  False, True ], # 1st subset: get the 1st and 3rd value from colA,
    [False, False, True ], # 2nd subset: get the 3rd value from colA,
    [True,  True,  True ], # 3rd subset: get all values from colA,
    [False, False, False], # 4th subset: get no values from colA,
    [False, True,  False]  # 5th subset: get the 2nd value from colA,
])

# I append the subsets here sequentially (this needs to be changed)
result = []

# I need to make this loop parallel
# And if possible, I need to get the result as a numpy array directly (not as a list)
start = time.process_time()
for row in mask_matrix:
    result.append(df_example[row]['colA'].values)
print("Baseline: {}".format( time.process_time() - start))


#====================================
#NEW CODE HERE
#Convert dataframe to numpy array
df_matrix = df_example.to_numpy().T
#Loop through all columns if desired
for column_idx in range(df_matrix.shape[0]):
    start = time.process_time()
    values = np.multiply(df_matrix[column_idx], mask_matrix)
    print("Vectorized: {}".format(time.process_time() - start))
    break
print(values.shape)
print(values)

This returns the following

Baseline: 0.0007300000000000084
Vectorized: 4.500000000007276e-05
(5, 3)
[[10  0 30]
 [ 0  0 30]
 [10 20 30]
 [ 0  0  0]
 [ 0 20  0]]

From here there are a few approaches, instead of using False in the Boolean matrix, you could use np.nan which yields the following:

[[10. nan 30.]
 [nan nan 30.]
 [10. 20. 30.]
 [nan nan  nan]
 [nan 20. nan]]

If you wanted to remove the nans, you would have to loop through, but I believe this would be inefficient.

  • Related