Match rows based on closest index-CodePudding

I have two different dataframes:

Dataframe1:

data = {'Server Name': ['PhysicalWindows1', 'PhysicalWindows2', 'PhysicalLinux1', 'PhysicalLinux2'],
        'Chips1': [1, 1, 2, 2], 
        'pCpu Cores': [8, 8, 32, 32],
        'Cpu Clock': [3400, 3400, 2600, 2600]}
  
# Create DataFrame
df = pd.DataFrame(data)

Dataframe 2:

data = {'Chips': [1, 1, 1, 2, 2],
        'Cores': [8, 8, 8, 11, 11],
        'Clock Speed': [3300, 3500, 2900, 900, 100], 
        'Avg Watts Idle': [58.5, 63, 25, 83.8, 65]
}

data = pd.DataFrame(data)

Now I am trying to match these two dataframes based on a direct match between two indices (Chips and Cores) and the closest match with the Clock Speed to get the column 'Avg Watts Idle'. Basically, the first row in dataframe1 has ['PhysicalWindows1', 1, 8, 3400] and it matches to three different rows in dataframe2: [1,8, 3300, 58.5], [1,8,3500, 63], and [1,8,2900, 25] and therefore I would like to do an average of only the first two and not the third. My dataframe would preferably look like:

'Server Name': ['PhysicalWindows1', 'PhysicalWindows2', 'PhysicalLinux1', 'PhysicalLinux2'],
'Chips1': [1, 1, 2, 2], 
'pCpu Cores': [8, 8, 32, 32],
'Cpu Clock': [3400, 3400, 2600, 2600]
'Avg Watts Idle' : [(58.5 63/2), (58.5 63/2), N/A, N/A]

CodePudding user response：

import pandas as pd
data = {'Chips': [1, 1, 1, 2, 2],
        'Cores': [8, 8, 8, 11, 11],
        'Clock Speed': [3300, 3500, 2900, 900, 100], 
        'Avg Watts Idle': [58.5, 63, 25, 83.8, 65]
}

df2 = pd.DataFrame(data)

data = {'Server Name': ['PhysicalWindows1', 'PhysicalWindows2', 'PhysicalLinux1', 'PhysicalLinux2'],
        'Chips1': [1, 1, 2, 2], 
        'pCpu Cores': [8, 8, 32, 32],
        'Cpu Clock': ['3400 ', '3400', '2600', '2600']}
  
# Create DataFrame
df1 = pd.DataFrame(data)


merged = df1.merge(df2, left_on=['Chips1','pCpu Cores'], right_on=['Chips','Cores'], how='left')


merged['Clock Speed'] = merged['Cpu Clock'].astype(int)
merged['diff'] = abs(merged['Clock Speed'] - merged['Clock Speed'].astype(int))
# Take average of closest 2
merged['Avg Watts Idle'] = merged.groupby(['Chips1', 'pCpu Cores'])['Avg Watts Idle'].transform(lambda x: (x.loc[merged.loc[x.index, 'diff'].nsmallest(2).index].sum())/2)

# Drop the extra columns and keep only required columns
merged = merged[['Server Name', 'Chips1', 'pCpu Cores', 'Cpu Clock', 'Avg Watts Idle']]
# Drop duplicates in rows to get your output in the Question
merged.drop_duplicates()

You can simply Merge the two dataframes with matched index you want then find the closest match for Clock Speed and drop extra column to get your result.