I am trying to figure out a problem and am running a test example on a dummy dataset - built here
import pandas as pd
data = [['tom', 30, 'sales', 5], ['nick', 35, 'sales', 8], ['juli', 24, 'marketing', 4], ['franz', 40, 'marketing', 6], ['jon', 50, 'marketing', 6], ['jeremie', 60, 'marketing', 6]]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Department', 'Tenure'])
For each row, I want to find the mean age of everyone else in the department who is older than the row in question, for example Tom (30) in sales, should return the mean of his age and Nick, who is older, so 32.5 as the mean age, but for Nick it should return 35 as Tom in his department is younger than him. The code below achieves that - but I am looking for a quicker more efficient way?!
#Dynamically get mean, where age is greater than the line in question - almost definitely a better
#way of doing this though
def sumWindow(group):
x = group['Age'].mean()
group['Mean Dept Age'] = x
return group
Name = []
Age = []
Department = []
Tenure = []
MeanDeptAge = []
for index, row in df.iterrows():
n = row['Name']
a = row['Age']
df_temp = df[df['Age'] >= a]
df_present = df_temp.groupby(df['Department']).apply(sumWindow)
df_present['Relevant Name'] = n
df_final = df_present[df_present['Name'] == df_present['Relevant Name']]
Name.append(df_final.iloc[0,0])
Age.append(df_final.iloc[0,1])
Department.append(df_final.iloc[0,2])
Tenure.append(df_final.iloc[0,3])
MeanDeptAge.append(df_final.iloc[0,4])
del df_final
df_final = pd.DataFrame({'Name': Name,
'Age': Age,
'Department': Department,
'Tenure': Tenure,
'Mean Department Age - Greater Than Emp Age': MeanDeptAge,
})
df_final
Thanks!
I have tried lots of different solutions filtering within the groupby clause etc
CodePudding user response:
Use a grouped expanding.mean
on the DataFrame sorted in descending order or Age:
df['Mean Department Age - Greater Than Emp Age'] = (df
.sort_values(by='Age', ascending=False)
.groupby('Department')['Age']
.expanding().mean()
.droplevel(0)
)
NB. this would handle potential duplicated ages based on order, you should define how you want to proceed if this happens in your real use case.
Output:
Name Age Department Tenure Mean Department Age - Greater Than Emp Age
0 tom 30 sales 5 32.5
1 nick 35 sales 8 35.0
2 juli 24 marketing 4 43.5
3 franz 40 marketing 6 50.0
4 jon 50 marketing 6 55.0
5 jeremie 60 marketing 6 60.0
CodePudding user response:
def function1(dd:pd.DataFrame):
dd1=dd.sort_values("Age",ascending=False).Age.expanding().mean()
return dd1.rename("Mean Department Age - Greater Than Emp Age")
df.join(df.groupby('Department').apply(function1).droplevel(0))
out
Name Age Department Tenure Mean Department Age - Greater Than Emp Age
0 tom 30 sales 5 32.5
1 nick 35 sales 8 35.0
2 juli 24 marketing 4 43.5
3 franz 40 marketing 6 50.0
4 jon 50 marketing 6 55.0
5 jeremie 60 marketing 6 60.0