Home > Software design >  Efficient element wise comparison between a pandas frame and a list
Efficient element wise comparison between a pandas frame and a list

Time:09-21

Suppose that I have 2 objects:

  • A is a list of names
  • B is a pandas frame with 3 columns: 'name','friend1','friend2', which list a person's name and the names of their 2 best friends

For my application, I would like to know: for each person in A, a list of people in B for which the person in A is among the 2 best friends. To be specific, for each person in A, I would like a list my_bool of booleans that can be computed as follows:

for current_name in A:
    my_bool = (B['friend1'] == current_name) | (B['friend2'] == current_name)
    [ ,,, other computation using my_bool ... ]

The computation works, but I'm trying to improve on its efficiency. For example, when A has length 15k and B has 50k rows, the computation time is very long.

My tuition is that: it's not efficient that the loop scans through the 50k rows of B for each person in A. Is there a way to vectorize the computation to create, say, a 15k x 50k matrix all_bools in 1 shot (without loop), then read off my_bool (as the rows of all_bools) later as needed? In another language, I can implement this idea, but I'm unable to do it in Python. If this idea is garbage too, please feel free to put forth your suggestion.

CodePudding user response:

You can use the pd.Series.isin method, which implicitly converts the list to a hash map with a more efficient look-up time.

my_bool = B['friend1'].isin(A) | (B['friend2'].isin(B)

CodePudding user response:

You can try this:

import numpy as np
import pandas as pd 

A = np.array(['Bob', 'Becky', 'Mark', 'Joe', 'Zeke'])
B = pd.DataFrame([['Joe', 'Mark', 'Bob'], ['Becky', 'Joe', 'Bob'], ['Mark', 'Tom', 'Trisha']], columns=['name', 'friend1', 'friend2'])

# resulting shape is (len(A), len(B.friend1))
friend1 = np.equal(A.reshape(-1, 1), B.friend1.values)
friend2 = np.equal(A.reshape(-1, 1), B.friend2.values)

# your final all_bools for later reference
all_bools = friend1 | friend2

# processing one at a time:
for i in range(all_bools.shape[0]):
    my_bool = all_bools[i]
    in_friends = B.loc[my_bool, 'name'].values
    if in_friends.any():
        print(f'My name is {A[i]} and Im friends with {in_friends}')

Given that it's numpy it is highly vectorized and efficient. However... the downside to creating the array of all_bools all in one go is that there is a very good chance it will consume a lot of memory to store it.

  • Related