Suppose that I have 2 objects:
A
is a list of namesB
is a pandas frame with 3 columns: 'name','friend1','friend2', which list a person's name and the names of their 2 best friends
For my application, I would like to know: for each person in A
, a list of people in B
for which the person in A
is among the 2 best friends. To be specific, for each person in A
, I would like a list my_bool
of booleans that can be computed as follows:
for current_name in A:
my_bool = (B['friend1'] == current_name) | (B['friend2'] == current_name)
[ ,,, other computation using my_bool ... ]
The computation works, but I'm trying to improve on its efficiency. For example, when A has length 15k and B has 50k rows, the computation time is very long.
My tuition is that: it's not efficient that the loop scans through the 50k rows of B
for each person in A
. Is there a way to vectorize the computation to create, say, a 15k x 50k matrix all_bools
in 1 shot (without loop), then read off my_bool
(as the rows of all_bools
) later as needed? In another language, I can implement this idea, but I'm unable to do it in Python. If this idea is garbage too, please feel free to put forth your suggestion.
CodePudding user response:
You can use the pd.Series.isin
method, which implicitly converts the list to a hash map with a more efficient look-up time.
my_bool = B['friend1'].isin(A) | (B['friend2'].isin(B)
CodePudding user response:
You can try this:
import numpy as np
import pandas as pd
A = np.array(['Bob', 'Becky', 'Mark', 'Joe', 'Zeke'])
B = pd.DataFrame([['Joe', 'Mark', 'Bob'], ['Becky', 'Joe', 'Bob'], ['Mark', 'Tom', 'Trisha']], columns=['name', 'friend1', 'friend2'])
# resulting shape is (len(A), len(B.friend1))
friend1 = np.equal(A.reshape(-1, 1), B.friend1.values)
friend2 = np.equal(A.reshape(-1, 1), B.friend2.values)
# your final all_bools for later reference
all_bools = friend1 | friend2
# processing one at a time:
for i in range(all_bools.shape[0]):
my_bool = all_bools[i]
in_friends = B.loc[my_bool, 'name'].values
if in_friends.any():
print(f'My name is {A[i]} and Im friends with {in_friends}')
Given that it's numpy it is highly vectorized and efficient.
However... the downside to creating the array of all_bools
all in one go is that there is a very good chance it will consume a lot of memory to store it.