I want to merge two dataframe based on shared keys with different values for comparing. But I get multiplied strange rows as a result.
I have two data frames like below.
df1
module_name latency index
0 crc32_pclmul 588 0
1 crc32_pclmul 790 4096
2 crc32_pclmul 604 8192
3 crc32_pclmul 584 12288
4 crc32_pclmul 572 0
5 crc32_pclmul 1012 4096
6 crc32_pclmul 552 8192
7 crc32_pclmul 440 12288
8 crc32_pclmul 588 0
9 crc32_pclmul 644 4096
df2
module_name latency2 index
0 crc32_pclmul 440 0
1 crc32_pclmul 472 4096
2 crc32_pclmul 448 8192
3 crc32_pclmul 456 12288
4 crc32_pclmul 440 0
5 crc32_pclmul 444 4096
6 crc32_pclmul 564 8192
7 crc32_pclmul 476 12288
8 crc32_pclmul 444 0
9 crc32_pclmul 584 4096
I utilized below python code for merge two dataframes
df3 = pd.merge(df1, df2, on=['module_name', 'index'])
I expected results like below.
module_name latecy index latency2
0 crc32_pclmul 588 0 440
1 crc32_pclmul 790 4096 472
2 crc32_pclmul 604 8192 448
3 crc32_pclmul 584 12288 456
4 crc32_pclmul 572 0 440
5 crc32_pclmul 1012 4096 444
6 crc32_pclmul 552 8192 564
7 crc32_pclmul 440 12288 476
8 crc32_pclmul 588 0 444
9 crc32_pclmul 644 4096 584
The purpose of the data is to compare two different results (latency, latency2) for further analysis. But I get strange results like below..
module_name latency index latency2
0 crc32_pclmul 588 0 440
1 crc32_pclmul 588 0 440
2 crc32_pclmul 588 0 444
3 crc32_pclmul 588 0 476
4 crc32_pclmul 588 0 432
5 crc32_pclmul 588 0 452
6 crc32_pclmul 588 0 432
7 crc32_pclmul 588 0 748
8 crc32_pclmul 588 0 448
9 crc32_pclmul 588 0 440
10 crc32_pclmul 572 0 440
11 crc32_pclmul 572 0 440
12 crc32_pclmul 572 0 444
13 crc32_pclmul 572 0 476
14 crc32_pclmul 572 0 432
15 crc32_pclmul 572 0 452
16 crc32_pclmul 572 0 432
17 crc32_pclmul 572 0 748
18 crc32_pclmul 572 0 448
19 crc32_pclmul 572 0 440
...
100 crc32_pclmul 644 4096 584
my data is duplicated due to the multiple probing and I read it all for comparing (For instance, I have 10 latency data per module_name, index)
I don't know how to avoid unwanted row multiplexing
CodePudding user response:
Example
data1 = [['crc32_pclmul', 588, 0], ['crc32_pclmul', 790, 4096], ['crc32_pclmul', 604, 8192],
['crc32_pclmul', 584, 12288], ['crc32_pclmul', 572, 0], ['crc32_pclmul', 1012, 4096],
['crc32_pclmul', 552, 8192], ['crc32_pclmul', 440, 12288], ['crc32_pclmul', 588, 0], ['crc32_pclmul', 644, 4096]]
data2 = [['crc32_pclmul', 440, 0], ['crc32_pclmul', 472, 4096], ['crc32_pclmul', 448, 8192],
['crc32_pclmul', 456, 12288], ['crc32_pclmul', 440, 0], ['crc32_pclmul', 444, 4096],
['crc32_pclmul', 564, 8192], ['crc32_pclmul', 476, 12288], ['crc32_pclmul', 444, 0], ['crc32_pclmul', 584, 4096]]
col1 = ['module_name', 'latency', 'index']
col2 = ['module_name', 'latency2', 'index']
df1 = pd.DataFrame(data1, columns=col1)
df2 = pd.DataFrame(data2, columns=col2)
Code
if you must use merge
, make key for merge:
df1['key'] = df1.groupby(['module_name', 'index']).cumcount()
df2['key'] = df1.groupby(['module_name', 'index']).cumcount()
out = df1.merge(df2, how='left').drop('key', axis=1)
out
module_name latency index latency2
0 crc32_pclmul 588 0 440
1 crc32_pclmul 790 4096 472
2 crc32_pclmul 604 8192 448
3 crc32_pclmul 584 12288 456
4 crc32_pclmul 572 0 440
5 crc32_pclmul 1012 4096 444
6 crc32_pclmul 552 8192 564
7 crc32_pclmul 440 12288 476
8 crc32_pclmul 588 0 444
9 crc32_pclmul 644 4096 584