Show top values in multi-indexed dataframe for each group-CodePudding

I have the following dataframe:

Country Sector
AU      Semiconductor
AU      Telecom
AU      Banking
AU      Telecom
AU      Semiconductor
AU      Telecom
SE      Telecom
SE      Semiconductor
SE      Semiconductor
SE      Semiconductor
SE      Banking
SE      Telecom

With following code I have created multi-index with the number of instances for each type of sector by country:

df.groupby(["Country","Sector"]).size()

The result is:

Country Sector
AU      Banking       1
        Semiconductor 2
        Telecom       3
SE      Banking       1
        Semiconductor 3
        Telecom       2

What I want to do is to sort the values by group and only show the top 2 sectors and add the rest of the sectors togheter in "Other":

AU      Telecom       3
        Semiconductor 2
        Other         1
SE      Semiconductor 3
        Telecom       2
        Other         1

CodePudding user response：

You can use rank to

rank = df.groupby('Country')['Sector'].rank(method='dense', ascending=False)
out = (df.assign(Sector=df['Sector'].where(rank < 3, other='Other'))
         .value_counts(['Country', 'Sector'])
         .rename('Count').reset_index()
         .sort_values(['Country', 'Count'], ascending=[True, False], ignore_index=True))

Output

>>> out
  Country         Sector  Count
0      AU        Telecom      3
1      AU  Semiconductor      2
2      AU          Other      1
3      SE  Semiconductor      3
4      SE        Telecom      2
5      SE          Other      1

CodePudding user response：

You can do it step by step

s = df.value_counts().sort_values(ascending=False)
s1 = s.groupby(level = 0).head(2)
s2 = s.drop(s1.index,axis=0)
s2 = s2.groupby(level=0).sum()
s2.index = pd.MultiIndex.from_product([s2.index,['Other']])
out = pd.concat([s1,s2]).sort_index(level=0)
Out[391]: 
Country  Sector       
AU       Other            1
         Semiconductor    2
         Telecom          3
SE       Other            1
         Semiconductor    3
         Telecom          2
dtype: int64