Creating flexible, iterative field name in Python function or loop-CodePudding

I am creating a DataFrame with the code below:

import pandas as pd  
df1= pd.DataFrame({'segment': ['abc','abc','abc','abc','abc','xyz','xyz','xyz','xyz','xyz','xyz','xyz'], 
                   'prod_a_clients': [5,0,12,25,0,2,5,24,0,1,21,7],
                   'prod_b_clients': [15,6,0,12,8,0,17,0,2,23,15,0]                    })

abc_seg= df1[(df1['segment']=='abc')] 
xyz_seg= df1[(df1['segment']=='xyz')]  

seg_prod= df1[(df1['segment']=='abc') & (df1['prod_a_clients']>0)]   
abc_seg['prod_a_mean'] = statistics.mean(seg_prod['prod_a_clients'])

seg_prod= df1[(df1['segment']=='abc') & (df1['prod_b_clients']>0)]   
abc_seg['prod_b_mean'] = statistics.mean(seg_prod['prod_b_clients'])                            

seg_prod= df1[(df1['segment']=='xyz') & (df1['prod_a_clients']>0)]   
xyz_seg['prod_a_mean'] = statistics.mean(seg_prod['prod_a_clients'])

seg_prod= df1[(df1['segment']=='xyz') & (df1['prod_b_clients']>0)]   
xyz_seg['prod_b_mean'] = statistics.mean(seg_prod['prod_b_clients'])                              

segs_combined= [abc_seg,xyz_seg]
df2= pd.concat(segs_combined, ignore_index=True) 
print(df2)

As you can see from the result I need to calculate a mean for every product and segment combination I have. I'm going to be doing this for 100s of products and segments. I have tried many different ways of doing this with a loop or a function and have gotten close with something like the following:

def prod_seg(sg,prd):
    seg_prod= df1[(df1['segment']==sg) & (df1[prd '_clients']>0)]   
    prod_name= prd '_clients'   
    col_name= prd '_average'
    df_name= sg '_seg'
    df_name "['" prd '_average' "']"=statistics.mean(seg_prod[prod_name])
    return

The issue is that I need to create a unique column for every iteration and the way I am doing it above is obviously not working.

Is there any way I can recreate what I did above in a loop or function?

CodePudding user response：

You could use groupby in order to calculate the mean per group. Also, replace the 0 with nan and it gets skipped by the mean calculation. The script then looks like:

import numpy as np
import pandas as pd

df1 = pd.DataFrame({'segment': ['abc', 'abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz',
                                'xyz', 'xyz', 'xyz'],
                    'prod_a_clients': [5, 0, 12, 25, 0, 2, 5, 24, 0, 1, 21, 7],
                    'prod_b_clients': [15, 6, 0, 12, 8, 0, 17, 0, 2, 23, 15, 0]})

df1.set_index("segment", inplace=True, drop=True)
df1[df1 == 0] = np.nan
mean_values = dict()
for seg_key, seg_df in df1.groupby(level=0):
    mean_value = seg_df.mean(numeric_only=True)
    mean_values[seg_key] = mean_value

results = pd.DataFrame.from_dict(mean_values)
print(results)

The results is:

                  abc    xyz
prod_a_clients  14.00  10.00
prod_b_clients  10.25  14.25

CodePudding user response：

Instead of using a loop, you can derive the same result by first using where on the 0s in the clients columns (which replaces 0s with NaN); then groupby the "segments" column and transform the "mean" method.

The point of where is that mean method by default skips NaN values, so by converting 0s with NaN, we make sure 0s are not considered for the mean.

transform(mean) transforms the mean (which is an aggregate value) to align with the original DataFrame, so every row has a matching mean value.

clients = ['prod_a_clients', 'prod_b_clients']
out = (df1.join(df1[['segment']]
                .join(df1[clients].where(df1[clients]>0))
                .groupby('segment').transform('mean')
                .add_suffix('_mean')))

Output:

   segment  prod_a_clients  prod_b_clients  prod_a_clients_mean  prod_b_clients_mean  
0      abc               5              15                 14.0                10.25  
1      abc               0               6                 14.0                10.25  
2      abc              12               0                 14.0                10.25  
3      abc              25              12                 14.0                10.25  
4      abc               0               8                 14.0                10.25  
5      xyz               2               0                 10.0                14.25  
6      xyz               5              17                 10.0                14.25  
7      xyz              24               0                 10.0                14.25  
8      xyz               0               2                 10.0                14.25  
9      xyz               1              23                 10.0                14.25  
10     xyz              21              15                 10.0                14.25  
11     xyz               7               0                 10.0                14.25