Home > Software engineering >  Pandas - Why there are duplicates after join?
Pandas - Why there are duplicates after join?

Time:04-19

I have train with 3756 rows and test with 500 rows, after join I had 798974 rows.

code for join:

test.join(train.set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')

Use of drop duplicates is works, but required a lot of time and memory.

CodePudding user response:

Reason is duplicated values of column link_1 in test and train, so for each duplicated values get all combinations between:

train = pd.DataFrame({"link_1": [0, 0, 0, 0, 1, 1, 1, 1],
                      'claps_link_1_mean': range(8)})
test = pd.DataFrame({"link_1": [0, 1, 1, 1]})

df = test.join(train.set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')
print (df)
   link_1  claps_link_1_mean
0       0                  0
0       0                  1
0       0                  2
0       0                  3
1       1                  4
1       1                  5
1       1                  6
1       1                  7
2       1                  4
2       1                  5
2       1                  6
2       1                  7
3       1                  4
3       1                  5
3       1                  6
3       1                  7

If remove duplicates in one of them before join all working well:

test.join(train.drop_duplicates('link_1').set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')
  • Related