how to get top n values in pandas dataframe if it has repeated values-CodePudding

I have a pandas dataframe say:

x	y	z
1	a	x
1	b	y
1	c	z
2	a	x
2	b	x
3	a	y
4	a	z

If i wanted top 2 values by x, I mean top 2 values by x column which gives:

x	y	z
1	a	x
1	b	y
1	c	z
2	a	x
2	b	x

If i wanted top 2 values by y, I mean top 2 values by y column which gives:

x	y	z
1	a	x
1	b	y
2	a	x
2	b	x
3	a	y
4	a	z

How can I achieve this?

CodePudding user response：

You can use:

>>> df[df['x'].isin(df['x'].value_counts().head(2).index)]
   x  y  z
0  1  a  x
1  1  b  y
2  1  c  z
3  2  a  x
4  2  b  x

>>> df[df['y'].isin(df['y'].value_counts().head(2).index)]
   x  y  z
0  1  a  x
1  1  b  y
3  2  a  x
4  2  b  x
5  3  a  y
6  4  a  z

CodePudding user response：

def select_top_k(df, col, top_k):
    grouping_df = df.groupby(col)
    gr_list = list(grouping_df.groups)[:top_k]
    
    temp = grouping_df.filter(lambda x: x[col].iloc[0] in gr_list)
    return temp

data = {'x': [1, 1, 1, 2, 2, 3, 4],
        'y': ['a', 'b', 'c', 'a', 'b', 'a', 'a'],
        'z': ['x', 'y', 'z', 'x', 'x', 'y', 'z']}
df = pd.DataFrame(data)

col = 'x'
top_k = 2

select_top_k(df, col, top_k)