Pandas Groupby return n smallest - return whole row and not just the indices?-CodePudding

I have a data set like this:

query  hit  score1  score2
group1  hit1  0.9  0.8
group1  hit2  0.2  0.9
group1  hit3  0.2  0.9
group2  hit1  0.1  0.3
group2  hit2  0.6  0.8
group2  hit3  0.3  0.3
group2  hit4  0.7  0.5

I want the output to return the lowest n (e.g. 2) rows, per group, based on a columnn (e.g. score 1), so this would turn into:

query  hit  score1  score2
group1  hit2  0.2  0.9
group1  hit3  0.2  0.9
group2  hit1  0.1  0.3
group2  hit3  0.3  0.3

I wrote:

df = pd.read_csv('file_name.txt',sep='\t',header=0)
df = df.groupby('query')['score1'].nsmallest(2).join(df,['score1'])
print(df)

based on seeing this answer.

But I get the error series has no attribute join.

I did this because when I run:

df = pd.read_csv('file_name.txt',sep='\t',header=0)
df = df.groupby('query')['score1'].nsmallest(2)
print(df)

but it only returns the column that the grouping is done by, I wasn't sure how to extend to return the whole row, hence why I tried based on a SO answer that gave me the error.

CodePudding user response：

Sort first, and then get the first N rows from each group.

>>> df.sort_values(["query", "score1"]).groupby("query").head(2) 
    query   hit  score1  score2
1  group1  hit2     0.2     0.9
2  group1  hit3     0.2     0.9
3  group2  hit1     0.1     0.3
5  group2  hit3     0.3     0.3

Alternatively, you can use nsmallest in a groupby.apply:

>>> df.groupby("query").apply(lambda s: s.nsmallest(2, "score1"))
           query   hit  score1  score2
query                                 
group1 1  group1  hit2     0.2     0.9
       2  group1  hit3     0.2     0.9
group2 3  group2  hit1     0.1     0.3
       5  group2  hit3     0.3     0.3

CodePudding user response：

Try using rank:

>>> df[df.groupby("query")["score1"].transform("rank").le(2)]

    query   hit  score1  score2
1  group1  hit2     0.2     0.9
2  group1  hit3     0.2     0.9
3  group2  hit1     0.1     0.3
5  group2  hit3     0.3     0.3

Alternatively with nsmallest and loc:

>>> df.loc[df.groupby('query')['score1'].nsmallest(2).reset_index(level=0, drop=True).index]
    query   hit  score1  score2
1  group1  hit2     0.2     0.9
2  group1  hit3     0.2     0.9
3  group2  hit1     0.1     0.3
5  group2  hit3     0.3     0.3