Is it possible to sort values by the count values of each group's sum. without breaking the index level? Both attempts I commented out would sort but breaks the index level.
#DataFrame
ff = pd.DataFrame([('P1', 17, 'male'),
('P2', 10, 'female'),
('P3', 10, 'male'),
('P4', 19, 'female'),
('P5', 10, 'male'),
('P6', 12, 'male'),
('P7', 12, 'male'),
('P8', 15, 'female'),
('P9', 15, 'female'),
('P10', 10, 'male')],
columns=['Name', 'Age', 'Sex'])
# Attempts
(
ff
.groupby(['Age', 'Sex'])
.agg(**{
'Count': pd.NamedAgg(column="Name", aggfunc='count'),
'Who': pd.NamedAgg(column="Name", aggfunc=lambda x: ', '.join([i for i in x]))})
# .sort_values('Count') <- this breaks the index level
# .sort_values(['Count', 'Age']) <- this too breaks the index level
)
Original Data:
Count | Who | ||
---|---|---|---|
Age | Sex | ||
10 | Female | 1 | p2 |
male | 3 | p3,p5,p10 | |
12 | male | 2 | p6,p7 |
15 | female | 2 | p8,p9 |
17 | male | 1 | p1 |
19 | female | 1 | p4 |
Desired Output: (sort values by the sum of 'Age' group, but keep the grouped index)
Count | Who | ||
---|---|---|---|
Age | Sex | ||
17 | male | 1 | p1 |
19 | female | 1 | p4 |
12 | male | 2 | p6,p7 |
15 | female | 2 | p8,p9 |
10 | Female | 1 | p2 |
male | 3 | p3,p5,p10 |
Edit: This is how I finally solve the problem, any more advices are appreciated.
# DataFrame -- I update a bit for testcases.
ff = pd.DataFrame([('P1', 19, 'male'),
('P2', 10, 'female'),
('P3', 10, 'male'),
('P4', 19, 'female'),
('P5', 10, 'male'),
('P6', 12, 'male'),
('P7', 12, 'male'),
('P7', 12, 'male'),
('P7', 12, 'male'),
('P7', 12, 'male'),
('P8', 15, 'female'),
('P9', 15, 'female'),
('P10', 10, 'male')],
columns=['Name', 'Age', 'Sex'])
# It works !
(
ff.groupby(['Age', 'Sex']).agg(**{
'Count': pd.NamedAgg(column="Name", aggfunc='count'),
'Who': pd.NamedAgg(column="Name", aggfunc=lambda x: ', '.join([i for i in x]))})
# Sort by 'Count' and keep the group adding 'tmp'
.assign(
tmp=lambda x: x.reset_index().groupby('Age')['Count'].transform('sum').to_numpy())
.sort_values(['tmp','Age'])
# drop tmp
.drop('tmp', axis=1)
)
CodePudding user response:
You can reshape by DataFrame.unstack
and sorting index by sum of both Sex
values if exist, then reshape back by DataFrame.stack
:
df1 = df.unstack()
df1 = df1.sort_index(key=df1.sum(axis=1, numeric_only=True).get).stack().astype(df.dtypes)
print (df1)
Count Who
Age Sex
17 male 1 P1
19 female 1 P4
12 male 2 P6, P7
15 female 2 P8, P9
10 female 1 P2
male 3 P3, P5, P10
Another idea is sorting by sum both values with GroupBy.transform
:
df['tmp'] = df.groupby('Age')['Count'].transform('sum')
df1 = df.sort_values(['tmp','Age']).drop('tmp', axis=1)
print (df1)
Count Who
Age Sex
17 male 1 P1
19 female 1 P4
12 male 2 P6, P7
15 female 2 P8, P9
10 female 1 P2
male 3 P3, P5, P10
EDIT: One line solution is:
df = (
ff
.groupby(['Age', 'Sex'])
.agg(**{
'Count': pd.NamedAgg(column="Name", aggfunc='count'),
'Who': pd.NamedAgg(column="Name", aggfunc=', '.join)})
.assign(tmp = lambda x: x.groupby('Age')['Count'].transform('sum'))
.sort_values(['tmp','Age'])
.drop('tmp', axis=1))
print (df)
Count Who
Age Sex
17 male 1 P1
19 female 1 P4
12 male 2 P6, P7
15 female 2 P8, P9
10 female 1 P2
male 3 P3, P5, P10
CodePudding user response:
There you go.
Let's keep the temp variable in data
.
data = ff.groupby(['Age', 'Sex']).agg(**{
'Count': pd.NamedAgg(column="Name", aggfunc='count'),
'Who': pd.NamedAgg(column="Name", aggfunc=lambda x: ', '.join([i for i in x]))})
You can write a custom function to do what you want to do in each group by apply
function.
For example.
data.groupby("Age", group_keys=False).apply(lambda x: x.sort_values("Count", ascending=False))
Count Who
Age Sex
10 male 3 P3, P5, P10
female 1 P2
12 male 2 P6, P7
15 female 2 P8, P9
17 male 1 P1
19 female 1 P4
Or, change to ascending order
data.groupby("Age", group_keys=False).apply(lambda x: x.sort_values("Count", ascending=False))
Count Who
Age Sex
10 female 1 P2
male 3 P3, P5, P10
12 male 2 P6, P7
15 female 2 P8, P9
17 male 1 P1
19 female 1 P4
Or if you wanna sort by each level of multi-index. you can do such way.
You can sort the index by adding level
args in the sort_index
function.
For example:
data.sort_index(level=0, ascending=True)
Sort the first index by ascending order.
Count Who
Age Sex
19 female 1 P4
17 male 1 P1
15 female 2 P8, P9
12 male 2 P6, P7
10 male 3 P3, P5, P10
female 1 P2
data.sort_index(level=[0,1], ascending=[False, True])
Sort the first index in ascending order and the second index in descending order.
Count Who
Age Sex
19 female 1 P4
17 male 1 P1
15 female 2 P8, P9
12 male 2 P6, P7
10 female 1 P2
male 3 P3, P5, P10
By the way.
The breaking index level
is not a special result. It is just a display optimizations
For example.
You can create one by yourself like:
pd.DataFrame({"a":[1,2,3,4,5]}, index=pd.MultiIndex.from_arrays([[10,10,20,10,10],['F','M','F','M','F']],names=['A','B']))
a
A B
10 F 1
M 2
20 F 3
10 M 4
F 5