pandas merge using multiple keys create unwanted rows-CodePudding

I want to merge two dataframe based on shared keys with different values for comparing. But I get multiplied strange rows as a result.

I have two data frames like below.

df1

     module_name  latency  index
0   crc32_pclmul      588      0
1   crc32_pclmul      790   4096
2   crc32_pclmul      604   8192
3   crc32_pclmul      584  12288
4   crc32_pclmul      572      0
5   crc32_pclmul     1012   4096
6   crc32_pclmul      552   8192
7   crc32_pclmul      440  12288
8   crc32_pclmul      588      0
9   crc32_pclmul      644   4096

df2

     module_name  latency2  index
0   crc32_pclmul       440      0
1   crc32_pclmul       472   4096
2   crc32_pclmul       448   8192
3   crc32_pclmul       456  12288
4   crc32_pclmul       440      0
5   crc32_pclmul       444   4096
6   crc32_pclmul       564   8192
7   crc32_pclmul       476  12288
8   crc32_pclmul       444      0
9   crc32_pclmul       584   4096

I utilized below python code for merge two dataframes

df3 = pd.merge(df1, df2, on=['module_name', 'index'])

I expected results like below.

     module_name  latecy  index    latency2
0   crc32_pclmul      588      0        440
1   crc32_pclmul      790   4096        472
2   crc32_pclmul      604   8192        448
3   crc32_pclmul      584  12288        456
4   crc32_pclmul      572      0        440
5   crc32_pclmul     1012   4096        444
6   crc32_pclmul      552   8192        564
7   crc32_pclmul      440  12288        476
8   crc32_pclmul      588      0        444
9   crc32_pclmul      644   4096        584

The purpose of the data is to compare two different results (latency, latency2) for further analysis. But I get strange results like below..

      module_name  latency  index  latency2
0    crc32_pclmul      588      0       440
1    crc32_pclmul      588      0       440
2    crc32_pclmul      588      0       444
3    crc32_pclmul      588      0       476
4    crc32_pclmul      588      0       432
5    crc32_pclmul      588      0       452
6    crc32_pclmul      588      0       432
7    crc32_pclmul      588      0       748
8    crc32_pclmul      588      0       448
9    crc32_pclmul      588      0       440
10   crc32_pclmul      572      0       440
11   crc32_pclmul      572      0       440
12   crc32_pclmul      572      0       444
13   crc32_pclmul      572      0       476
14   crc32_pclmul      572      0       432
15   crc32_pclmul      572      0       452
16   crc32_pclmul      572      0       432
17   crc32_pclmul      572      0       748
18   crc32_pclmul      572      0       448
19   crc32_pclmul      572      0       440
...
100  crc32_pclmul      644   4096       584

my data is duplicated due to the multiple probing and I read it all for comparing (For instance, I have 10 latency data per module_name, index)

I don't know how to avoid unwanted row multiplexing

CodePudding user response：

Example

data1 = [['crc32_pclmul', 588, 0], ['crc32_pclmul', 790, 4096], ['crc32_pclmul', 604, 8192],
         ['crc32_pclmul', 584, 12288], ['crc32_pclmul', 572, 0], ['crc32_pclmul', 1012, 4096],
         ['crc32_pclmul', 552, 8192], ['crc32_pclmul', 440, 12288], ['crc32_pclmul', 588, 0], ['crc32_pclmul', 644, 4096]]
data2 = [['crc32_pclmul', 440, 0], ['crc32_pclmul', 472, 4096], ['crc32_pclmul', 448, 8192],
         ['crc32_pclmul', 456, 12288], ['crc32_pclmul', 440, 0], ['crc32_pclmul', 444, 4096],
         ['crc32_pclmul', 564, 8192], ['crc32_pclmul', 476, 12288], ['crc32_pclmul', 444, 0], ['crc32_pclmul', 584, 4096]]
col1 = ['module_name', 'latency', 'index']
col2 = ['module_name', 'latency2', 'index']
df1 = pd.DataFrame(data1, columns=col1)
df2 = pd.DataFrame(data2, columns=col2)

Code

if you must use merge, make key for merge:

df1['key'] = df1.groupby(['module_name', 'index']).cumcount()
df2['key'] = df1.groupby(['module_name', 'index']).cumcount()
out = df1.merge(df2, how='left').drop('key', axis=1)

out

    module_name     latency index   latency2
0   crc32_pclmul    588     0       440
1   crc32_pclmul    790     4096    472
2   crc32_pclmul    604     8192    448
3   crc32_pclmul    584     12288   456
4   crc32_pclmul    572     0       440
5   crc32_pclmul    1012    4096    444
6   crc32_pclmul    552     8192    564
7   crc32_pclmul    440     12288   476
8   crc32_pclmul    588     0       444
9   crc32_pclmul    644     4096    584