Assume I have a dataframe with columns stated below (consist more column in actual data).
Customer Group1 jan_revenue feb_revenue mar_revenue
Sam Bank A 40 50 0
Wilson Bank A 60 70 30
Jay Bank B 10 40 40
Jim Bank A 0 40 70
Yan Bank C 0 40 90
Tim Bank C 10 0 50
I want to calculate the mean for each customer but only those are non-zero.
For example, customer Sam has mean (40 50)/2 = 45 and Wilson (60 70 30)/3 = 53.3333
Since I have a large number of columns, so i choose to use iloc but my approach included all the 0.
df['avg_revenue21'] = df.iloc[:,27:39].mean(axis=1)
May I know is there a way for conditional mean while using iloc?
Thank you
CodePudding user response:
You can use select_dtypes
to get numeric columns, replace
the zeros with NA, then get the mean
as usual:
df.select_dtypes('number').replace(0, pd.NA).mean(axis=1)
output:
Sam 45.000000
Wilson 53.333333
Jay 30.000000
Jim 55.000000
Yan 65.000000
Tim 30.000000
dtype: float64
As new column:
df['avg_revenue21'] = df.select_dtypes('number').replace(0, pd.NA).mean(axis=1)
Customer Group1 jan_revenue feb_revenue mar_revenue avg_revenue21
Sam Bank A 40 50 0 45.000000
Wilson Bank A 60 70 30 53.333333
Jay Bank B 10 40 40 30.000000
Jim Bank A 0 40 70 55.000000
Yan Bank C 0 40 90 65.000000
Tim Bank C 10 0 50 30.000000
variants:
If the input are strings:
df['avg_revenue21'] = df.apply(pd.to_numeric, errors='coerce').replace(0, pd.NA).mean(axis=1)
If you only want to consider a subset:
df['avg_revenue21'] = df.filter(regex='(feb|mar)_').replace(0, pd.NA).mean(axis=1)
or:
df['avg_revenue21'] = df[['feb_revenue', 'mar_revenue']].replace(0, pd.NA).mean(axis=1)
CodePudding user response:
Use DataFrame.replace
with mean
:
df['new'] = df.replace(0, np.nan).mean(axis=1)
print (df)
Customer Group1 jan_revenue feb_revenue mar_revenue new
0 Sam Bank A 40 50 0 45.000000
1 Wilson Bank A 60 70 30 53.333333
2 Jay Bank B 10 40 40 30.000000
3 Jim Bank A 0 40 70 55.000000
4 Yan Bank C 0 40 90 65.000000
5 Tim Bank C 10 0 50 30.000000
Or:
df['new'] = df.replace(0, np.nan).mean(numeric_only=True, axis=1)
print (df)
Customer Group1 jan_revenue feb_revenue mar_revenue new
0 Sam Bank A 40 50 0 45.000000
1 Wilson Bank A 60 70 30 53.333333
2 Jay Bank B 10 40 40 30.000000
3 Jim Bank A 0 40 70 55.000000
4 Yan Bank C 0 40 90 65.000000
5 Tim Bank C 10 0 50 30.000000
EDIT: If possible columns are not numeric, use to_numeric
with errors='coerce'
for missing values if no numbers:
df['new'] = df.apply(pd.to_numeric, errors='coerce').replace(0, np.nan).mean(axis=1)