Efficient element wise comparison between a pandas frame and a list-CodePudding

Suppose that I have 2 objects:

A is a list of names
B is a pandas frame with 3 columns: 'name','friend1','friend2', which list a person's name and the names of their 2 best friends

For my application, I would like to know: for each person in A, a list of people in B for which the person in A is among the 2 best friends. To be specific, for each person in A, I would like a list my_bool of booleans that can be computed as follows:

for current_name in A:
    my_bool = (B['friend1'] == current_name) | (B['friend2'] == current_name)
    [ ,,, other computation using my_bool ... ]

The computation works, but I'm trying to improve on its efficiency. For example, when A has length 15k and B has 50k rows, the computation time is very long.

My tuition is that: it's not efficient that the loop scans through the 50k rows of B for each person in A. Is there a way to vectorize the computation to create, say, a 15k x 50k matrix all_bools in 1 shot (without loop), then read off my_bool (as the rows of all_bools) later as needed? In another language, I can implement this idea, but I'm unable to do it in Python. If this idea is garbage too, please feel free to put forth your suggestion.

CodePudding user response：

You can use the pd.Series.isin method, which implicitly converts the list to a hash map with a more efficient look-up time.

my_bool = B['friend1'].isin(A) | (B['friend2'].isin(B)

CodePudding user response：

You can try this:

import numpy as np
import pandas as pd 

A = np.array(['Bob', 'Becky', 'Mark', 'Joe', 'Zeke'])
B = pd.DataFrame([['Joe', 'Mark', 'Bob'], ['Becky', 'Joe', 'Bob'], ['Mark', 'Tom', 'Trisha']], columns=['name', 'friend1', 'friend2'])

# resulting shape is (len(A), len(B.friend1))
friend1 = np.equal(A.reshape(-1, 1), B.friend1.values)
friend2 = np.equal(A.reshape(-1, 1), B.friend2.values)

# your final all_bools for later reference
all_bools = friend1 | friend2

# processing one at a time:
for i in range(all_bools.shape[0]):
    my_bool = all_bools[i]
    in_friends = B.loc[my_bool, 'name'].values
    if in_friends.any():
        print(f'My name is {A[i]} and Im friends with {in_friends}')

Given that it's numpy it is highly vectorized and efficient. However... the downside to creating the array of all_bools all in one go is that there is a very good chance it will consume a lot of memory to store it.