Combine rows and average column if another column is minimum-CodePudding

I have a pandas dataframe:

             Server  Clock 1  Clock 2  Power   diff
0  PhysicalWindows1     3400   3300.0   58.5  100.0
1  PhysicalWindows1     3400   3500.0   63.0  100.0
2  PhysicalWindows1     3400   2900.0   25.0  500.0
3  PhysicalWindows2     3600   3300.0   83.8  300.0
4  PhysicalWindows2     3600   3500.0   65.0  100.0
5  PhysicalWindows2     3600   2900.0   10.0  700.0
6    PhysicalLinux1     2600      NaN    NaN    NaN
7    PhysicalLinux1     2600      NaN    NaN    NaN
8              Test     2700   2700.0   30.0    0.0

Basically, I would like to average the Power for each server but only if the difference is minimum. For example, if you look at the 'PhysicalWindows1' server, I have 3 rows, two have a diff of 100, and one has a diff of 500. Since I have two rows with a diff of 100, I would like to average out my Power of 58.5 and 63.0. For 'PhysicalWindows2', since there is only one row which has the least diff, we return the power for that one row - 65. If NaN, return Nan, and if there is only one match, return the power for that one match.

My resultant dataframe would look like this:

             Server  Clock 1            Power  
0  PhysicalWindows1     3400    (58.5 63.0)/2
1  PhysicalWindows2     3600             65.0
2    PhysicalLinux1     2600              NaN
3              Test     2700             30.0

CodePudding user response：

Use groupby with dropna=False to avoid to remove PhysicalLinux1 and sort=True to sort index level (lowest diff on top) then drop_duplicates to keep only one instance of (Server, Clock 1):

out = (df.groupby(['Server', 'Clock 1', 'diff'], dropna=False, sort=True)['Power']
         .mean().droplevel('diff').reset_index().drop_duplicates(['Server', 'Clock 1']))

# Output
             Server  Clock 1  Power
0    PhysicalLinux1     2600    NaN
1  PhysicalWindows1     3400  60.75
3  PhysicalWindows2     3600  65.00
6              Test     2700  30.00

CodePudding user response：

Here is a possible solution using df.groupby() and pd.merge()

grp_df = df.groupby(['Server', 'diff'])['Power'].mean().reset_index()
grp_df = grp_df.groupby('Server').first().reset_index()
grp_df = grp_df.rename(columns={'diff': 'min_diff', 'Power': 'Power_avg'})

df_out = (pd.merge(df[['Server', 'Clock 1']].drop_duplicates(), grp_df, on='Server', how='left')
            .drop(['min_diff'], axis=1))
print(df_out)

             Server  Clock 1  Power_avg
0  PhysicalWindows1     3400      60.75
1  PhysicalWindows2     3600      65.00
2    PhysicalLinux1     2600        NaN
3              Test     2700      30.00

CodePudding user response：

Use a double groupby, first groupby.transform to mask the non-max Power, then groupby.agg to aggregate

m = df.groupby('Server')['diff'].transform('min').eq(df['diff'])

(df.assign(Power=df['Power'].where(m))
 .groupby('Server', sort=False, as_index=False)
 .agg({'Clock 1': 'first', 'Power': 'mean'})
)

Output:

             Server  Clock 1  Power
0  PhysicalWindows1     3400  60.75
1  PhysicalWindows2     3600  65.00
2    PhysicalLinux1     2600    NaN
3              Test     2700  30.00

CodePudding user response：

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "Server": ['PhysicalWindows1', 'PhysicalWindows1', 'PhysicalWindows1', 'PhysicalWindows2', 
                   'PhysicalWindows2', 'PhysicalWindows2', 'PhysicalLinux1', 'PhysicalLinux1', 'Test'],
        "Clock 1": [3400, 3400, 3400, 3600, 3600, 3600, 2600, 2600, 2700],
        "Clock 2": [3300.0, 3500.0, 2900.0, 3300.0, 3500.0, 2900.0, np.nan, np.nan, 2700.0],
        "Power": [58.5, 63.0, 25.0, 83.8, 65.0, 10.0, np.nan, np.nan, 30.0],
        "diff": [100.0, 100.0, 500.0, 300.0, 100.0, 700.0, np.nan, np.nan, 0.0]
    }
)

r = (df.groupby(['Server'])
       .apply(lambda d: d[d['diff']==d['diff'].min()])
       .reset_index(drop=True)
       .groupby(['Server'])
       .agg({"Clock 1":'mean', "Power":'mean', "diff":'first'})
       .reset_index()
     )

r = (r.append(df[df['diff'].isnull()]
                           .drop_duplicates()
                           ).drop(['Clock 2', 'diff'], axis=1)
                            .reset_index(drop=True)
                            )

print(r)

#              Server  Clock 1  Power
# 0  PhysicalWindows1   3400.0  60.75
# 1  PhysicalWindows2   3600.0  65.00
# 2              Test   2700.0  30.00
# 0    PhysicalLinux1   2600.0    NaN