Home > Blockchain >  Match rows based on closest index
Match rows based on closest index

Time:02-03

I have two different dataframes:

Dataframe1:

data = {'Server Name': ['PhysicalWindows1', 'PhysicalWindows2', 'PhysicalLinux1', 'PhysicalLinux2'],
        'Chips1': [1, 1, 2, 2], 
        'pCpu Cores': [8, 8, 32, 32],
        'Cpu Clock': [3400, 3400, 2600, 2600]}
  
# Create DataFrame
df = pd.DataFrame(data)

Dataframe 2:

data = {'Chips': [1, 1, 1, 2, 2],
        'Cores': [8, 8, 8, 11, 11],
        'Clock Speed': [3300, 3500, 2900, 900, 100], 
        'Avg Watts Idle': [58.5, 63, 25, 83.8, 65]
}

data = pd.DataFrame(data)

Now I am trying to match these two dataframes based on a direct match between two indices (Chips and Cores) and the closest match with the Clock Speed to get the column 'Avg Watts Idle'. Basically, the first row in dataframe1 has ['PhysicalWindows1', 1, 8, 3400] and it matches to three different rows in dataframe2: [1,8, 3300, 58.5], [1,8,3500, 63], and [1,8,2900, 25] and therefore I would like to do an average of only the first two and not the third. My dataframe would preferably look like:

'Server Name': ['PhysicalWindows1', 'PhysicalWindows2', 'PhysicalLinux1', 'PhysicalLinux2'],
'Chips1': [1, 1, 2, 2], 
'pCpu Cores': [8, 8, 32, 32],
'Cpu Clock': [3400, 3400, 2600, 2600]
'Avg Watts Idle' : [(58.5 63/2), (58.5 63/2), N/A, N/A]

CodePudding user response:

import pandas as pd
data = {'Chips': [1, 1, 1, 2, 2],
        'Cores': [8, 8, 8, 11, 11],
        'Clock Speed': [3300, 3500, 2900, 900, 100], 
        'Avg Watts Idle': [58.5, 63, 25, 83.8, 65]
}

df2 = pd.DataFrame(data)

data = {'Server Name': ['PhysicalWindows1', 'PhysicalWindows2', 'PhysicalLinux1', 'PhysicalLinux2'],
        'Chips1': [1, 1, 2, 2], 
        'pCpu Cores': [8, 8, 32, 32],
        'Cpu Clock': ['3400 ', '3400', '2600', '2600']}
  
# Create DataFrame
df1 = pd.DataFrame(data)


merged = df1.merge(df2, left_on=['Chips1','pCpu Cores'], right_on=['Chips','Cores'], how='left')


merged['Clock Speed'] = merged['Cpu Clock'].astype(int)
merged['diff'] = abs(merged['Clock Speed'] - merged['Clock Speed'].astype(int))
# Take average of closest 2
merged['Avg Watts Idle'] = merged.groupby(['Chips1', 'pCpu Cores'])['Avg Watts Idle'].transform(lambda x: (x.loc[merged.loc[x.index, 'diff'].nsmallest(2).index].sum())/2)

# Drop the extra columns and keep only required columns
merged = merged[['Server Name', 'Chips1', 'pCpu Cores', 'Cpu Clock', 'Avg Watts Idle']]
# Drop duplicates in rows to get your output in the Question
merged.drop_duplicates()

You can simply Merge the two dataframes with matched index you want then find the closest match for Clock Speed and drop extra column to get your result.

enter image description here

  • Related