I have a pandas dataframe say:
x | y | z |
---|---|---|
1 | a | x |
1 | b | y |
1 | c | z |
2 | a | x |
2 | b | x |
3 | a | y |
4 | a | z |
If i wanted top 2 values by x, I mean top 2 values by x column which gives:
x | y | z |
---|---|---|
1 | a | x |
1 | b | y |
1 | c | z |
2 | a | x |
2 | b | x |
If i wanted top 2 values by y, I mean top 2 values by y column which gives:
x | y | z |
---|---|---|
1 | a | x |
1 | b | y |
2 | a | x |
2 | b | x |
3 | a | y |
4 | a | z |
How can I achieve this?
CodePudding user response:
You can use:
>>> df[df['x'].isin(df['x'].value_counts().head(2).index)]
x y z
0 1 a x
1 1 b y
2 1 c z
3 2 a x
4 2 b x
>>> df[df['y'].isin(df['y'].value_counts().head(2).index)]
x y z
0 1 a x
1 1 b y
3 2 a x
4 2 b x
5 3 a y
6 4 a z
CodePudding user response:
def select_top_k(df, col, top_k):
grouping_df = df.groupby(col)
gr_list = list(grouping_df.groups)[:top_k]
temp = grouping_df.filter(lambda x: x[col].iloc[0] in gr_list)
return temp
data = {'x': [1, 1, 1, 2, 2, 3, 4],
'y': ['a', 'b', 'c', 'a', 'b', 'a', 'a'],
'z': ['x', 'y', 'z', 'x', 'x', 'y', 'z']}
df = pd.DataFrame(data)
col = 'x'
top_k = 2
select_top_k(df, col, top_k)