Home > database >  Calculating mean for multiple columns based on conditions in multiple columns
Calculating mean for multiple columns based on conditions in multiple columns

Time:06-18

I have a dataframe with more approximately 200 rows. The dataframe looks like below

Rule A Rule C Rule N Rule A.1 Rule C.1 Rule N.1
4 4 4 1 1 1
5 5 5 1 1 1
6 6 6 0 1 1

I am trying to calculate averages for each rule(A,C...N) when values in Rule A.1/C.1...N.1 is 1. The output would look like:-

Rule Average
A 4.5
C 5
............. ......
N 5

CodePudding user response:

here is one way to accomplish it

The solution is based on the provided sample, i.e. 6 columns. the number 3 here would need to expand/shrink based on the number of columns you'll have for value and its assumed that there will same number of columns for the rule

pd.DataFrame(np.multiply(
                df.iloc[:,3:].to_numpy(),
                df.iloc[:,0:3].to_numpy()),
             columns=df.columns[:3]).replace(0,np.nan).mean()

       average
Rule A     4.5
Rule C     5.0
Rule N     5.0

CodePudding user response:

Here is another solution. Albeit, not the most elegant, but it works.

I am assuming that the dataframe "df" here is the sample dataframe that you provided.

import pandas as pd
import numpy as np

df = pd.DataFrame(data=[[4,1,4,1,4,1], [5,1,5,1,5,1], [6,0,6,1,6,1]], 
                  columns=['Rule A', 'Rule A.1', 'Rule C', 'Rule C.1', 'Rule N', 'Rule N.1'])

for col in df.columns:
    if ".1" in col:
        df[col] = df[col].replace(1, True)
        df[col] = df[col].replace(0, False)

df_transposed = df.transpose().rename_axis("Rule").reset_index()

num_rules = len([r for r in df.columns if "Rule" in r])

rules = []
averages = []
for r in range(num_rules):
    if r%2 == 0:
        rule = df_transposed.iloc[r, 0]
        mask = df_transposed.iloc[r 1, 1:]
        filtered_vals = df_transposed.iloc[r, 1:].loc[mask]
        avg = np.mean(filtered_vals.values)
        rules.append(rule)
        averages.append(avg)

df_new = pd.DataFrame(data=zip(rules, averages), columns=['Rules', 'Average'])

df_new

The other assumption in this solution is that the columns must be ordered in the way you suggested in your post, i.e. A, A.1, C, C.1 etc. (alternating between the rule and the boolean flag). This solution will work for any number of "Rules" as long as they follow this alternating column convention.

  • Related