Home > Back-end >  How to get a top most values of each group sizes?
How to get a top most values of each group sizes?

Time:09-21

I have DataFrame similar to the following:

In [141]: dd = pd.DataFrame({'c1': ['A','A','B','A','B','B','A','C','B','A','D','D','D','C','B','A','C','A','D','B','A','C'],
     ...: 'c2': ['b', 'a', 'b', 'c', 'd', 'a', 'd', 'd', 'a', 'a', 'c', 'a', 'a', 'b', 'b', 'c', 'c', 'a', 'b', 'd', 'b', 'a'],
     ...: 'c3': range(22)})

I'm grouping it by the two columns and calculating each group sizes:

In [155]: dd.groupby(['c1','c2']).size()
Out[155]: 
c1  c2
A   a     3
    b     2
    c     2
    d     1
B   a     2
    b     2
    d     2
C   a     1
    b     1
    c     1
    d     1
D   a     2
    b     1
    c     1
dtype: int64

Now I want to sort each group by the size and extract 3 top most values. The desired output would look like the following:

c1  c2
A   a     3
    b     2
    c     2
B   a     2
    b     2
    d     2
C   a     1
    b     1
    c     1
D   a     2
    b     1
    c     1
dtype: int64

But I'm failing to get the required output by using some known techniques:

In [158]: dd.groupby(['c1','c2']).size().head(3)
Out[158]: 
c1  c2
A   a     3
    b     2
    c     2
dtype: int64

In [159]: dd.groupby(['c1','c2']).size().to_frame().nlargest(3, 0)
Out[159]: 
       0
c1 c2   
A  a   3
   b   2
   c   2

Is there a way?

CodePudding user response:

Initially, you are grouping by c1 and c2, but when taking the "3 top most values", based on your examples, I think you mean top 3 within each group of c1.

That would suggest you need a second groupby operation:

dd.groupby(['c1','c2']).size().groupby("c1", group_keys=False).nlargest(3) 

Result:

c1  c2
A   a     3
    b     2
    c     2
B   a     2
    b     2
    d     2
C   a     1
    b     1
    c     1
D   a     2
    b     1
    c     1
dtype: int64

CodePudding user response:

See below for a much faster alternative.

You need a double groupby:

dd.groupby(['c1','c2']).size().groupby(level='c1', group_keys=False).nlargest(3)

output:

c1  c2
A   a     3
    b     2
    c     2
B   a     2
    b     2
    d     2
C   a     1
    b     1
    c     1
D   a     2
    b     1
    c     1
dtype: int64

Much faster alternative with value_counts groupby.head, with a different order of the output:

dd.value_counts(['c1','c2']).groupby(level='c1', group_keys=False).head(3)

Output:

c1  c2
A   a     3
    b     2
    c     2
B   a     2
    b     2
    d     2
D   a     2
C   a     1
    b     1
    c     1
D   b     1
    c     1
dtype: int64

The alternative is ~4-5x faster on the provided example, and ~100x faster on a random 100k rows dataframe (4.4s vs 48ms).

  • Related