Home > Software engineering >  Sort values with multi-index/ groupby object 'by group' without breaking the index level
Sort values with multi-index/ groupby object 'by group' without breaking the index level

Time:05-06

Is it possible to sort values by the count values of each group's sum. without breaking the index level? Both attempts I commented out would sort but breaks the index level.

#DataFrame
ff = pd.DataFrame([('P1', 17, 'male'),
                   ('P2', 10, 'female'),
                   ('P3', 10, 'male'),
                   ('P4', 19, 'female'),
                   ('P5', 10, 'male'),
                   ('P6', 12, 'male'),
                   ('P7', 12, 'male'),
                   ('P8', 15, 'female'),
                   ('P9', 15, 'female'),
                   ('P10', 10, 'male')],
                  columns=['Name', 'Age', 'Sex'])

# Attempts
(
    ff
    .groupby(['Age', 'Sex'])
    .agg(**{
        'Count': pd.NamedAgg(column="Name", aggfunc='count'),
        'Who': pd.NamedAgg(column="Name", aggfunc=lambda x: ', '.join([i for i in x]))})
#     .sort_values('Count')           <- this breaks the index level
#     .sort_values(['Count', 'Age'])  <- this too breaks the index level
)

Original Data:

Count Who
Age Sex
10 Female 1 p2
male 3 p3,p5,p10
12 male 2 p6,p7
15 female 2 p8,p9
17 male 1 p1
19 female 1 p4

Desired Output: (sort values by the sum of 'Age' group, but keep the grouped index)

Count Who
Age Sex
17 male 1 p1
19 female 1 p4
12 male 2 p6,p7
15 female 2 p8,p9
10 Female 1 p2
male 3 p3,p5,p10

Edit: This is how I finally solve the problem, any more advices are appreciated.

# DataFrame -- I update a bit for testcases.
ff = pd.DataFrame([('P1', 19, 'male'),
                   ('P2', 10, 'female'),
                   ('P3', 10, 'male'),
                   ('P4', 19, 'female'),
                   ('P5', 10, 'male'),
                   ('P6', 12, 'male'),
                   ('P7', 12, 'male'),
                   ('P7', 12, 'male'),
                   ('P7', 12, 'male'),
                   ('P7', 12, 'male'),
                   ('P8', 15, 'female'),
                   ('P9', 15, 'female'),
                   ('P10', 10, 'male')],
                  columns=['Name', 'Age', 'Sex'])

# It works !
(
    ff.groupby(['Age', 'Sex']).agg(**{
    'Count': pd.NamedAgg(column="Name", aggfunc='count'),
    'Who': pd.NamedAgg(column="Name", aggfunc=lambda x: ', '.join([i for i in x]))})
    # Sort by 'Count' and keep the group adding 'tmp'
    .assign(
        tmp=lambda x: x.reset_index().groupby('Age')['Count'].transform('sum').to_numpy())
    .sort_values(['tmp','Age'])
     # drop tmp
    .drop('tmp', axis=1)
)

CodePudding user response:

You can reshape by DataFrame.unstack and sorting index by sum of both Sex values if exist, then reshape back by DataFrame.stack:

df1 = df.unstack()
df1 = df1.sort_index(key=df1.sum(axis=1, numeric_only=True).get).stack().astype(df.dtypes)
print (df1)
            Count          Who
Age Sex                       
17  male        1           P1
19  female      1           P4
12  male        2       P6, P7
15  female      2       P8, P9
10  female      1           P2
    male        3  P3, P5, P10

Another idea is sorting by sum both values with GroupBy.transform:

df['tmp'] = df.groupby('Age')['Count'].transform('sum')

df1 = df.sort_values(['tmp','Age']).drop('tmp', axis=1)
print (df1)
             Count          Who
Age Sex                       
17  male        1           P1
19  female      1           P4
12  male        2       P6, P7
15  female      2       P8, P9
10  female      1           P2
    male        3  P3, P5, P10
    

EDIT: One line solution is:

df = (
    ff
    .groupby(['Age', 'Sex'])
    .agg(**{
        'Count': pd.NamedAgg(column="Name", aggfunc='count'),
        'Who': pd.NamedAgg(column="Name", aggfunc=', '.join)})
    
    .assign(tmp = lambda x: x.groupby('Age')['Count'].transform('sum'))
    .sort_values(['tmp','Age'])
    .drop('tmp', axis=1))
print (df)
            Count          Who
Age Sex                       
17  male        1           P1
19  female      1           P4
12  male        2       P6, P7
15  female      2       P8, P9
10  female      1           P2
    male        3  P3, P5, P10

CodePudding user response:

There you go.

Let's keep the temp variable in data.

data = ff.groupby(['Age', 'Sex']).agg(**{
    'Count': pd.NamedAgg(column="Name", aggfunc='count'),
    'Who': pd.NamedAgg(column="Name", aggfunc=lambda x: ', '.join([i for i in x]))})

You can write a custom function to do what you want to do in each group by apply function.

For example.

data.groupby("Age", group_keys=False).apply(lambda x: x.sort_values("Count", ascending=False))
            Count          Who
Age Sex                       
10  male        3  P3, P5, P10
    female      1           P2
12  male        2       P6, P7
15  female      2       P8, P9
17  male        1           P1
19  female      1           P4

Or, change to ascending order

data.groupby("Age", group_keys=False).apply(lambda x: x.sort_values("Count", ascending=False))
            Count          Who
Age Sex                       
10  female      1           P2
    male        3  P3, P5, P10
12  male        2       P6, P7
15  female      2       P8, P9
17  male        1           P1
19  female      1           P4

Or if you wanna sort by each level of multi-index. you can do such way.

You can sort the index by adding level args in the sort_index function.

For example:

  1. data.sort_index(level=0, ascending=True)

Sort the first index by ascending order.

            Count          Who
Age Sex                       
19  female      1           P4
17  male        1           P1
15  female      2       P8, P9
12  male        2       P6, P7
10  male        3  P3, P5, P10
    female      1           P2
  1. data.sort_index(level=[0,1], ascending=[False, True])

Sort the first index in ascending order and the second index in descending order.

            Count          Who
Age Sex                       
19  female      1           P4
17  male        1           P1
15  female      2       P8, P9
12  male        2       P6, P7
10  female      1           P2
    male        3  P3, P5, P10

By the way.

The breaking index level is not a special result. It is just a display optimizations

For example.

You can create one by yourself like:

pd.DataFrame({"a":[1,2,3,4,5]}, index=pd.MultiIndex.from_arrays([[10,10,20,10,10],['F','M','F','M','F']],names=['A','B']))
      a
A  B   
10 F  1
   M  2
20 F  3
10 M  4
   F  5
  • Related