Home > Back-end >  pandas average across dynamic number of columns
pandas average across dynamic number of columns

Time:11-18

I have a dataframe like as shown below

customer_id   revenue_m7   revenue_m8   revenue_m9  revenue_m10  
   1             1234         1231        1256         1239      
   2             5678         3425        3255         2345      

I would like to do the below

a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)

b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)

So, I tried the below

df['revenue_mean_2m'] = (df['revenue_m10'] df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10'] df['revenue_m9'] df['revenue_m8'] df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)

But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input

I expect my output to be like as below

customer_id   revenue_m7   revenue_m8   revenue_m9  revenue_m10   revenue_mean_2m   revenue_mean_4m 
           1             1234         1231        1256         1239    1867              1240  
           2             5678         3425        3255         2345    2800           3675.75       

CodePudding user response:

Use filter and slicing:

# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d ')

# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)

Output:

   customer_id  revenue_m7  revenue_m8  revenue_m9  revenue_m10  \
0            1        1234        1231        1256         1239   
1            2        5678        3425        3255         2345   

   revenue_mean_2m  revenue_mean_4m  
0           1247.5          1240.00  
1           2800.0          3675.75  

if column order it not guaranteed

Sort them with natural sorting

# shuffle the DataFrame columns for demo
df = df.sample(frac=1, axis=1)

# filter and reorder the needed columns
from natsort import natsort_key
df2 = df.filter(regex=r'revenue_m\d ').sort_index(key=natsort_key, axis=1)

CodePudding user response:

you could try something like this in reference to this post:

n_months = 4  # you could also do this in a loop for all months range(1, 12)

df[f'revenue_mean_{n_months}m'] = df.iloc[:, -n_months:-1].mean(axis=1)
  • Related