Home > Software engineering >  Conditional mean while using iloc pandas
Conditional mean while using iloc pandas

Time:09-28

Assume I have a dataframe with columns stated below (consist more column in actual data).

Customer   Group1     jan_revenue feb_revenue mar_revenue
  Sam      Bank A          40           50          0
Wilson     Bank A          60           70         30
  Jay      Bank B          10           40         40
  Jim      Bank A           0           40         70
  Yan      Bank C           0           40         90
  Tim      Bank C          10            0         50

I want to calculate the mean for each customer but only those are non-zero.

For example, customer Sam has mean (40 50)/2 = 45 and Wilson (60 70 30)/3 = 53.3333

Since I have a large number of columns, so i choose to use iloc but my approach included all the 0.

df['avg_revenue21'] = df.iloc[:,27:39].mean(axis=1)

May I know is there a way for conditional mean while using iloc?

Thank you

CodePudding user response:

You can use select_dtypes to get numeric columns, replace the zeros with NA, then get the mean as usual:

df.select_dtypes('number').replace(0, pd.NA).mean(axis=1)

output:

Sam       45.000000
Wilson    53.333333
Jay       30.000000
Jim       55.000000
Yan       65.000000
Tim       30.000000
dtype: float64

As new column:

df['avg_revenue21'] = df.select_dtypes('number').replace(0, pd.NA).mean(axis=1)

       Customer Group1  jan_revenue  feb_revenue  mar_revenue  avg_revenue21
Sam        Bank      A           40           50            0      45.000000
Wilson     Bank      A           60           70           30      53.333333
Jay        Bank      B           10           40           40      30.000000
Jim        Bank      A            0           40           70      55.000000
Yan        Bank      C            0           40           90      65.000000
Tim        Bank      C           10            0           50      30.000000

variants:

If the input are strings:

df['avg_revenue21'] = df.apply(pd.to_numeric, errors='coerce').replace(0, pd.NA).mean(axis=1)

If you only want to consider a subset:

df['avg_revenue21'] = df.filter(regex='(feb|mar)_').replace(0, pd.NA).mean(axis=1)

or:

df['avg_revenue21'] = df[['feb_revenue', 'mar_revenue']].replace(0, pd.NA).mean(axis=1)

CodePudding user response:

Use DataFrame.replace with mean:

df['new'] = df.replace(0, np.nan).mean(axis=1)
print (df)
  Customer  Group1  jan_revenue  feb_revenue  mar_revenue        new
0      Sam  Bank A           40           50            0  45.000000
1   Wilson  Bank A           60           70           30  53.333333
2      Jay  Bank B           10           40           40  30.000000
3      Jim  Bank A            0           40           70  55.000000
4      Yan  Bank C            0           40           90  65.000000
5      Tim  Bank C           10            0           50  30.000000

Or:

df['new'] = df.replace(0, np.nan).mean(numeric_only=True, axis=1)
print (df)
  Customer  Group1  jan_revenue  feb_revenue  mar_revenue        new
0      Sam  Bank A           40           50            0  45.000000
1   Wilson  Bank A           60           70           30  53.333333
2      Jay  Bank B           10           40           40  30.000000
3      Jim  Bank A            0           40           70  55.000000
4      Yan  Bank C            0           40           90  65.000000
5      Tim  Bank C           10            0           50  30.000000

EDIT: If possible columns are not numeric, use to_numeric with errors='coerce' for missing values if no numbers:

df['new'] = df.apply(pd.to_numeric, errors='coerce').replace(0, np.nan).mean(axis=1)
  • Related