I am creating a DataFrame with the code below:
import pandas as pd
df1= pd.DataFrame({'segment': ['abc','abc','abc','abc','abc','xyz','xyz','xyz','xyz','xyz','xyz','xyz'],
'prod_a_clients': [5,0,12,25,0,2,5,24,0,1,21,7],
'prod_b_clients': [15,6,0,12,8,0,17,0,2,23,15,0] })
abc_seg= df1[(df1['segment']=='abc')]
xyz_seg= df1[(df1['segment']=='xyz')]
seg_prod= df1[(df1['segment']=='abc') & (df1['prod_a_clients']>0)]
abc_seg['prod_a_mean'] = statistics.mean(seg_prod['prod_a_clients'])
seg_prod= df1[(df1['segment']=='abc') & (df1['prod_b_clients']>0)]
abc_seg['prod_b_mean'] = statistics.mean(seg_prod['prod_b_clients'])
seg_prod= df1[(df1['segment']=='xyz') & (df1['prod_a_clients']>0)]
xyz_seg['prod_a_mean'] = statistics.mean(seg_prod['prod_a_clients'])
seg_prod= df1[(df1['segment']=='xyz') & (df1['prod_b_clients']>0)]
xyz_seg['prod_b_mean'] = statistics.mean(seg_prod['prod_b_clients'])
segs_combined= [abc_seg,xyz_seg]
df2= pd.concat(segs_combined, ignore_index=True)
print(df2)
As you can see from the result I need to calculate a mean for every product and segment combination I have. I'm going to be doing this for 100s of products and segments. I have tried many different ways of doing this with a loop or a function and have gotten close with something like the following:
def prod_seg(sg,prd):
seg_prod= df1[(df1['segment']==sg) & (df1[prd '_clients']>0)]
prod_name= prd '_clients'
col_name= prd '_average'
df_name= sg '_seg'
df_name "['" prd '_average' "']"=statistics.mean(seg_prod[prod_name])
return
The issue is that I need to create a unique column for every iteration and the way I am doing it above is obviously not working.
Is there any way I can recreate what I did above in a loop or function?
CodePudding user response:
You could use groupby in order to calculate the mean per group. Also, replace the 0 with nan and it gets skipped by the mean calculation. The script then looks like:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'segment': ['abc', 'abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz',
'xyz', 'xyz', 'xyz'],
'prod_a_clients': [5, 0, 12, 25, 0, 2, 5, 24, 0, 1, 21, 7],
'prod_b_clients': [15, 6, 0, 12, 8, 0, 17, 0, 2, 23, 15, 0]})
df1.set_index("segment", inplace=True, drop=True)
df1[df1 == 0] = np.nan
mean_values = dict()
for seg_key, seg_df in df1.groupby(level=0):
mean_value = seg_df.mean(numeric_only=True)
mean_values[seg_key] = mean_value
results = pd.DataFrame.from_dict(mean_values)
print(results)
The results is:
abc xyz
prod_a_clients 14.00 10.00
prod_b_clients 10.25 14.25
CodePudding user response:
Instead of using a loop, you can derive the same result by first using where
on the 0s in the clients columns (which replaces 0s with NaN); then groupby
the "segments" column and transform the "mean" method.
The point of where
is that mean method by default skips NaN values, so by converting 0s with NaN, we make sure 0s are not considered for the mean.
transform(mean)
transforms the mean (which is an aggregate value) to align with the original DataFrame, so every row has a matching mean value.
clients = ['prod_a_clients', 'prod_b_clients']
out = (df1.join(df1[['segment']]
.join(df1[clients].where(df1[clients]>0))
.groupby('segment').transform('mean')
.add_suffix('_mean')))
Output:
segment prod_a_clients prod_b_clients prod_a_clients_mean prod_b_clients_mean
0 abc 5 15 14.0 10.25
1 abc 0 6 14.0 10.25
2 abc 12 0 14.0 10.25
3 abc 25 12 14.0 10.25
4 abc 0 8 14.0 10.25
5 xyz 2 0 10.0 14.25
6 xyz 5 17 10.0 14.25
7 xyz 24 0 10.0 14.25
8 xyz 0 2 10.0 14.25
9 xyz 1 23 10.0 14.25
10 xyz 21 15 10.0 14.25
11 xyz 7 0 10.0 14.25