In my actual dataset data is 35 million rows by 20 columns and data2 is 4000 rows by 10 columns. Although this code would work it would take so long that my system would time out. So I am looking for an alternate solution to run faster.
import pandas as pd
data = pd.DataFrame({'variable1':[1,2,3,4,0,6,7], 'variable2':[1,2,3,4,5,6,7], 'variable3':[1,200,3,4,50,6,7], 'variable4':[1,2,3,4,5,6,7]})
data2 = pd.DataFrame({'variable1':[2,0], 'variable2':[2,5], 'variable3':[200,50], 'variable4':[17,20]})
target = []
for i in range(len(data)):
for j in range(len(data2)):
if (data['variable1'].iloc[i] == data2['variable1'].iloc[j]) and (data['variable2'].iloc[i] == data2['variable2'].iloc[j]):
target.append("Yes")
else: target.append("No")
Proper output would be:
[[1,1,1,1,"No"],
[2,2,200,2,"Yes"],
[3,3,3,3,"No"],
[4,4,4,4,"No"],
[0,5,50,5,"Yes"],
[6,6,6,6,"No"],
[7,7,7,7,"No"]]
CodePudding user response:
You want to test too many combinations (35M * 4K is 140 Billion, which is too big for pandas).
You can try with numpy, although it is still a lot of combinations and would require a lot of memory (hundreds of GB):
a1 = data['variable1'].to_numpy()
a2 = data['variable2'].to_numpy()
b1 = data2['variable1'].to_numpy()
b2 = data2['variable2'].to_numpy()
out = ((a1==b1[:,None])&(a2==b2[:,None])).ravel()
output:
array([False, True, False, False, False, False, False, False, False,
False, False, True, False, False])
CodePudding user response:
MultiIndex.isin
c = ['variable1', 'variable2']
data['match'] = data.set_index(c).index.isin(data2.set_index(c).index)
variable1 variable2 variable3 variable4 match
0 1 1 1 1 False
1 2 2 200 2 True
2 3 3 3 3 False
3 4 4 4 4 False
4 0 5 50 5 True
5 6 6 6 6 False
6 7 7 7 7 False
CodePudding user response:
df2=data.merge(data2,
left_on=['variable1', 'variable2'],
right_on=['variable1','variable2'],
suffixes=(None,'_y'),a
how='left'
).drop(columns='variable4_y').rename(columns={'variable3_y':'match'})
df2['match'] = np.where(df2['match'].isnull(), 'No', 'Yes')
df2.values.tolist()
[[1, 1, 1, 1, 'No'],
[2, 2, 200, 2, 'Yes'],
[3, 3, 3, 3, 'No'],
[4, 4, 4, 4, 'No'],
[0, 5, 50, 5, 'Yes'],
[6, 6, 6, 6, 'No'],
[7, 7, 7, 7, 'No']]