I have DataFrame similar to the following:
In [141]: dd = pd.DataFrame({'c1': ['A','A','B','A','B','B','A','C','B','A','D','D','D','C','B','A','C','A','D','B','A','C'],
...: 'c2': ['b', 'a', 'b', 'c', 'd', 'a', 'd', 'd', 'a', 'a', 'c', 'a', 'a', 'b', 'b', 'c', 'c', 'a', 'b', 'd', 'b', 'a'],
...: 'c3': range(22)})
I'm grouping it by the two columns and calculating each group sizes:
In [155]: dd.groupby(['c1','c2']).size()
Out[155]:
c1 c2
A a 3
b 2
c 2
d 1
B a 2
b 2
d 2
C a 1
b 1
c 1
d 1
D a 2
b 1
c 1
dtype: int64
Now I want to sort each group by the size and extract 3 top most values. The desired output would look like the following:
c1 c2
A a 3
b 2
c 2
B a 2
b 2
d 2
C a 1
b 1
c 1
D a 2
b 1
c 1
dtype: int64
But I'm failing to get the required output by using some known techniques:
In [158]: dd.groupby(['c1','c2']).size().head(3)
Out[158]:
c1 c2
A a 3
b 2
c 2
dtype: int64
In [159]: dd.groupby(['c1','c2']).size().to_frame().nlargest(3, 0)
Out[159]:
0
c1 c2
A a 3
b 2
c 2
Is there a way?
CodePudding user response:
Initially, you are grouping by c1
and c2
, but when taking the "3 top most values", based on your examples, I think you mean top 3 within each group of c1
.
That would suggest you need a second groupby operation:
dd.groupby(['c1','c2']).size().groupby("c1", group_keys=False).nlargest(3)
Result:
c1 c2
A a 3
b 2
c 2
B a 2
b 2
d 2
C a 1
b 1
c 1
D a 2
b 1
c 1
dtype: int64
CodePudding user response:
See below for a much faster alternative.
You need a double groupby:
dd.groupby(['c1','c2']).size().groupby(level='c1', group_keys=False).nlargest(3)
output:
c1 c2
A a 3
b 2
c 2
B a 2
b 2
d 2
C a 1
b 1
c 1
D a 2
b 1
c 1
dtype: int64
Much faster alternative with value_counts
groupby.head
, with a different order of the output:
dd.value_counts(['c1','c2']).groupby(level='c1', group_keys=False).head(3)
Output:
c1 c2
A a 3
b 2
c 2
B a 2
b 2
d 2
D a 2
C a 1
b 1
c 1
D b 1
c 1
dtype: int64
The alternative is ~4-5x faster on the provided example, and ~100x faster on a random 100k rows dataframe (4.4s vs 48ms).