Home > Software engineering >  Window function equivalent with filter clause in Python pandas
Window function equivalent with filter clause in Python pandas

Time:02-02

I am trying to figure out a problem and am running a test example on a dummy dataset - built here

import pandas as pd

data = [['tom', 30, 'sales', 5], ['nick', 35, 'sales', 8], ['juli', 24, 'marketing', 4], ['franz', 40, 'marketing', 6], ['jon', 50, 'marketing', 6], ['jeremie', 60, 'marketing', 6]]
  
df = pd.DataFrame(data, columns=['Name', 'Age', 'Department', 'Tenure'])

For each row, I want to find the mean age of everyone else in the department who is older than the row in question, for example Tom (30) in sales, should return the mean of his age and Nick, who is older, so 32.5 as the mean age, but for Nick it should return 35 as Tom in his department is younger than him. The code below achieves that - but I am looking for a quicker more efficient way?!

#Dynamically get mean, where age is greater than the line in question - almost definitely a better
#way of doing this though

def sumWindow(group):

    x = group['Age'].mean()
    group['Mean Dept Age'] = x
    return group

Name = []
Age = []
Department = []
Tenure = []
MeanDeptAge = []

for index, row in df.iterrows():
    
    
    n = row['Name']
    a = row['Age']
    df_temp = df[df['Age'] >= a]
    df_present = df_temp.groupby(df['Department']).apply(sumWindow)
    df_present['Relevant Name'] = n
    df_final = df_present[df_present['Name'] == df_present['Relevant Name']]
    Name.append(df_final.iloc[0,0])
    Age.append(df_final.iloc[0,1])
    Department.append(df_final.iloc[0,2])
    Tenure.append(df_final.iloc[0,3])
    MeanDeptAge.append(df_final.iloc[0,4])

del df_final    
    
df_final = pd.DataFrame({'Name': Name, 
                         'Age': Age, 
                         'Department': Department, 
                         'Tenure': Tenure, 
                         'Mean Department Age - Greater Than Emp Age': MeanDeptAge, 
                        })

df_final

Thanks!

I have tried lots of different solutions filtering within the groupby clause etc

CodePudding user response:

Use a grouped expanding.mean on the DataFrame sorted in descending order or Age:

df['Mean Department Age - Greater Than Emp Age'] = (df
   .sort_values(by='Age', ascending=False)
   .groupby('Department')['Age']
   .expanding().mean()
   .droplevel(0)
)

NB. this would handle potential duplicated ages based on order, you should define how you want to proceed if this happens in your real use case.

Output:

      Name  Age Department  Tenure  Mean Department Age - Greater Than Emp Age
0      tom   30      sales       5                                        32.5
1     nick   35      sales       8                                        35.0
2     juli   24  marketing       4                                        43.5
3    franz   40  marketing       6                                        50.0
4      jon   50  marketing       6                                        55.0
5  jeremie   60  marketing       6                                        60.0

CodePudding user response:

def function1(dd:pd.DataFrame):
    dd1=dd.sort_values("Age",ascending=False).Age.expanding().mean()
    return dd1.rename("Mean Department Age - Greater Than Emp Age")

df.join(df.groupby('Department').apply(function1).droplevel(0))

out

     Name  Age Department  Tenure  Mean Department Age - Greater Than Emp Age
0      tom   30      sales       5                                        32.5
1     nick   35      sales       8                                        35.0
2     juli   24  marketing       4                                        43.5
3    franz   40  marketing       6                                        50.0
4      jon   50  marketing       6                                        55.0
5  jeremie   60  marketing       6                                        60.0
  • Related