How do I group my dataframe by an unsorted list?-CodePudding

I have a larger dataset that is similarly structured to this dataframe (incl. the [ ]):

   Day  Worker_ID Skills Team_members
0    1          1  [1 3]        [1 3]
1    1          2  [2 5]        [4 2]
2    1          3  [4 2]        [3 1]
3    1          4  [3 3]        [2 4]
4    2          1  [2 4]        [1 3]
5    2          2  [3 5]        [4 2]
6    2          3  [4 3]        [3 1]
7    2          4  [2 2]        [2 4]

I would like to group my dataframe by the team of the workers so it looks like this (the [ ] are optional]:

   Day  Team_ID Team_Skills Team_members
0    1       1  [2.5 2.5]        [1 3]
1    1       2    [2.5 4]        [2 4]
2    2       1    [3 3.5]        [1 3]
3    2       2  [2.5 3.5]        [2 4]

I would assume the process looks like this:

Create a .copy() of original dataframe
Sort the vectors in the team_members-column
Group by the team-members column & and the day-column
Delete Worker_ID column
Create a new Team_ID-column so that every time a new combination of team_members is introduced, a new team number is allocated
Calculate the mean of skills for each team for that specific day and rename the column

Here is the code, if you want to try it out:

import pandas as pd

data = {'Day': [1, 1, 1, 1, 2, 2, 2, 2],
        'Worker_ID': [1, 2, 3, 4, 1, 2, 3, 4],
        'Skills': ['[1 3]', '[2 5]', '[4 2]', '[3 3]', '[2 4]', '[3 5]', '[4 3]', '[2 2]'],
        'Team_members': ['[1 3]', '[4 2]', '[3 1]', '[2 4]', '[1 3]', '[4 2]', '[3 1]', '[2 4]']}

df = pd.DataFrame(data)

CodePudding user response：

Here is the general logic for your problem. I'm not giving you the exact format as I'm on my phone, but this should give you a good starting point.

import numpy as np

df2 = (df.assign(TS=df['Skills'].str[1:-1].str.split())
         .explode('TS')
         .assign(TS=lambda d: d['TS'].astype(float))
       )

group = df2['Team_members'].apply(lambda r: tuple(set(r[1:-1].split())))

(df2.groupby(['Day', np.tile([0,1], len(df)), group])['TS'].mean()         
    .groupby(level=['Day', 'Team_members']).apply(list)
    .reset_index()
 )

Output:

   Day Team_members          TS
0    1       (1, 3)  [2.5, 2.5]
1    1       (4, 2)  [2.5, 4.0]
2    2       (1, 3)  [3.0, 3.5]
3    2       (4, 2)  [2.5, 3.5]

CodePudding user response：

Here is one way. First, convert the team members strings to lists, and then to "remove the order" convert them to frozensets (which are basically immutable sets).

>>> labels = df['Team_members'].str.findall("\d").map(frozenset) 
>>> labels 

0    (1, 3)
1    (2, 4)
2    (1, 3)
3    (2, 4)
4    (1, 3)
5    (2, 4)
6    (1, 3)
7    (2, 4)

Then group the data by day and then by the previous labels and compute the mean skills of each team


def skills_mean(group_skills):
    # worker skills matrix 
    group_skills = group_skills.str.findall("\d").tolist()
    # compute the mean along the columns
    mean_skills = np.asarray(group_skills, dtype=int).mean(0)
    return str(mean_skills)

>>> res = (
      df.groupby(["Day", labels], as_index=False)
        .agg({"Team_members": "min", 
              "Skills": skills_mean})
) 
>>> res

   Day Team_members     Skills
0    1        [1 3]  [2.5 2.5]
1    1        [2 4]  [2.5 4. ]
2    2        [1 3]  [3.  3.5]
3    2        [2 4]  [2.5 3.5]

Finally, add the 'Team_ID' column

>>> res['Team_ID'] = res.groupby("Team_members").ngroup().add(1)
>>> res

   Day Team_members     Skills  Team_ID
0    1        [1 3]  [2.5 2.5]        1
1    1        [2 4]  [2.5 4. ]        2
2    2        [1 3]  [3.  3.5]        1
3    2        [2 4]  [2.5 3.5]        2