I have train with 3756 rows and test with 500 rows, after join I had 798974 rows.
code for join:
test.join(train.set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')
Use of drop duplicates is works, but required a lot of time and memory.
CodePudding user response:
Reason is duplicated values of column link_1
in test
and train
, so for each duplicated values get all combinations between:
train = pd.DataFrame({"link_1": [0, 0, 0, 0, 1, 1, 1, 1],
'claps_link_1_mean': range(8)})
test = pd.DataFrame({"link_1": [0, 1, 1, 1]})
df = test.join(train.set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')
print (df)
link_1 claps_link_1_mean
0 0 0
0 0 1
0 0 2
0 0 3
1 1 4
1 1 5
1 1 6
1 1 7
2 1 4
2 1 5
2 1 6
2 1 7
3 1 4
3 1 5
3 1 6
3 1 7
If remove duplicates in one of them before join
all working well:
test.join(train.drop_duplicates('link_1').set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')