How to create a list based on column matches within 2 data frames? Python-CodePudding

In my actual dataset data is 35 million rows by 20 columns and data2 is 4000 rows by 10 columns. Although this code would work it would take so long that my system would time out. So I am looking for an alternate solution to run faster.

import pandas as pd

data = pd.DataFrame({'variable1':[1,2,3,4,0,6,7], 'variable2':[1,2,3,4,5,6,7], 'variable3':[1,200,3,4,50,6,7], 'variable4':[1,2,3,4,5,6,7]})

data2 = pd.DataFrame({'variable1':[2,0], 'variable2':[2,5], 'variable3':[200,50], 'variable4':[17,20]})


target = []

for i in range(len(data)):
    for j in range(len(data2)):
        if (data['variable1'].iloc[i] == data2['variable1'].iloc[j]) and (data['variable2'].iloc[i] == data2['variable2'].iloc[j]):
            target.append("Yes")
        else: target.append("No")

Proper output would be:
[[1,1,1,1,"No"],
[2,2,200,2,"Yes"],
[3,3,3,3,"No"],
[4,4,4,4,"No"],
[0,5,50,5,"Yes"],
[6,6,6,6,"No"],
[7,7,7,7,"No"]]

CodePudding user response：

You want to test too many combinations (35M * 4K is 140 Billion, which is too big for pandas).

You can try with numpy, although it is still a lot of combinations and would require a lot of memory (hundreds of GB):

a1 = data['variable1'].to_numpy()
a2 = data['variable2'].to_numpy()

b1 = data2['variable1'].to_numpy()
b2 = data2['variable2'].to_numpy()

out = ((a1==b1[:,None])&(a2==b2[:,None])).ravel()

output:

array([False,  True, False, False, False, False, False, False, False,
       False, False,  True, False, False])

CodePudding user response：

`MultiIndex.isin`

c = ['variable1', 'variable2']
data['match'] = data.set_index(c).index.isin(data2.set_index(c).index)

   variable1  variable2  variable3  variable4  match
0          1          1          1          1  False
1          2          2        200          2   True
2          3          3          3          3  False
3          4          4          4          4  False
4          0          5         50          5   True
5          6          6          6          6  False
6          7          7          7          7  False

CodePudding user response：

df2=data.merge(data2, 
           left_on=['variable1', 'variable2'],
           right_on=['variable1','variable2'],
           suffixes=(None,'_y'),a
           how='left'
          ).drop(columns='variable4_y').rename(columns={'variable3_y':'match'})
df2['match'] = np.where(df2['match'].isnull(), 'No', 'Yes')
df2.values.tolist()

[[1, 1, 1, 1, 'No'],
 [2, 2, 200, 2, 'Yes'],
 [3, 3, 3, 3, 'No'],
 [4, 4, 4, 4, 'No'],
 [0, 5, 50, 5, 'Yes'],
 [6, 6, 6, 6, 'No'],
 [7, 7, 7, 7, 'No']]