I have a larger dataset that is similarly structured to this dataframe (incl. the [ ]):
Day Worker_ID Skills Team_members
0 1 1 [1 3] [1 3]
1 1 2 [2 5] [4 2]
2 1 3 [4 2] [3 1]
3 1 4 [3 3] [2 4]
4 2 1 [2 4] [1 3]
5 2 2 [3 5] [4 2]
6 2 3 [4 3] [3 1]
7 2 4 [2 2] [2 4]
I would like to group my dataframe by the team of the workers so it looks like this (the [ ] are optional]:
Day Team_ID Team_Skills Team_members
0 1 1 [2.5 2.5] [1 3]
1 1 2 [2.5 4] [2 4]
2 2 1 [3 3.5] [1 3]
3 2 2 [2.5 3.5] [2 4]
I would assume the process looks like this:
- Create a .copy() of original dataframe
- Sort the vectors in the team_members-column
- Group by the team-members column & and the day-column
- Delete Worker_ID column
- Create a new Team_ID-column so that every time a new combination of team_members is introduced, a new team number is allocated
- Calculate the mean of skills for each team for that specific day and rename the column
Here is the code, if you want to try it out:
import pandas as pd
data = {'Day': [1, 1, 1, 1, 2, 2, 2, 2],
'Worker_ID': [1, 2, 3, 4, 1, 2, 3, 4],
'Skills': ['[1 3]', '[2 5]', '[4 2]', '[3 3]', '[2 4]', '[3 5]', '[4 3]', '[2 2]'],
'Team_members': ['[1 3]', '[4 2]', '[3 1]', '[2 4]', '[1 3]', '[4 2]', '[3 1]', '[2 4]']}
df = pd.DataFrame(data)
CodePudding user response:
Here is the general logic for your problem. I'm not giving you the exact format as I'm on my phone, but this should give you a good starting point.
import numpy as np
df2 = (df.assign(TS=df['Skills'].str[1:-1].str.split())
.explode('TS')
.assign(TS=lambda d: d['TS'].astype(float))
)
group = df2['Team_members'].apply(lambda r: tuple(set(r[1:-1].split())))
(df2.groupby(['Day', np.tile([0,1], len(df)), group])['TS'].mean()
.groupby(level=['Day', 'Team_members']).apply(list)
.reset_index()
)
Output:
Day Team_members TS
0 1 (1, 3) [2.5, 2.5]
1 1 (4, 2) [2.5, 4.0]
2 2 (1, 3) [3.0, 3.5]
3 2 (4, 2) [2.5, 3.5]
CodePudding user response:
Here is one way. First, convert the team members strings to lists, and then to "remove the order" convert them to frozensets (which are basically immutable sets).
>>> labels = df['Team_members'].str.findall("\d").map(frozenset)
>>> labels
0 (1, 3)
1 (2, 4)
2 (1, 3)
3 (2, 4)
4 (1, 3)
5 (2, 4)
6 (1, 3)
7 (2, 4)
Then group the data by day and then by the previous labels and compute the mean skills of each team
def skills_mean(group_skills):
# worker skills matrix
group_skills = group_skills.str.findall("\d").tolist()
# compute the mean along the columns
mean_skills = np.asarray(group_skills, dtype=int).mean(0)
return str(mean_skills)
>>> res = (
df.groupby(["Day", labels], as_index=False)
.agg({"Team_members": "min",
"Skills": skills_mean})
)
>>> res
Day Team_members Skills
0 1 [1 3] [2.5 2.5]
1 1 [2 4] [2.5 4. ]
2 2 [1 3] [3. 3.5]
3 2 [2 4] [2.5 3.5]
Finally, add the 'Team_ID' column
>>> res['Team_ID'] = res.groupby("Team_members").ngroup().add(1)
>>> res
Day Team_members Skills Team_ID
0 1 [1 3] [2.5 2.5] 1
1 1 [2 4] [2.5 4. ] 2
2 2 [1 3] [3. 3.5] 1
3 2 [2 4] [2.5 3.5] 2