I have a data set like this:
query hit score1 score2
group1 hit1 0.9 0.8
group1 hit2 0.2 0.9
group1 hit3 0.2 0.9
group2 hit1 0.1 0.3
group2 hit2 0.6 0.8
group2 hit3 0.3 0.3
group2 hit4 0.7 0.5
I want the output to return the lowest n (e.g. 2) rows, per group, based on a columnn (e.g. score 1), so this would turn into:
query hit score1 score2
group1 hit2 0.2 0.9
group1 hit3 0.2 0.9
group2 hit1 0.1 0.3
group2 hit3 0.3 0.3
I wrote:
df = pd.read_csv('file_name.txt',sep='\t',header=0)
df = df.groupby('query')['score1'].nsmallest(2).join(df,['score1'])
print(df)
based on seeing this answer.
But I get the error series has no attribute join
.
I did this because when I run:
df = pd.read_csv('file_name.txt',sep='\t',header=0)
df = df.groupby('query')['score1'].nsmallest(2)
print(df)
but it only returns the column that the grouping is done by, I wasn't sure how to extend to return the whole row, hence why I tried based on a SO answer that gave me the error.
CodePudding user response:
Sort first, and then get the first N rows from each group.
>>> df.sort_values(["query", "score1"]).groupby("query").head(2)
query hit score1 score2
1 group1 hit2 0.2 0.9
2 group1 hit3 0.2 0.9
3 group2 hit1 0.1 0.3
5 group2 hit3 0.3 0.3
Alternatively, you can use nsmallest
in a groupby.apply
:
>>> df.groupby("query").apply(lambda s: s.nsmallest(2, "score1"))
query hit score1 score2
query
group1 1 group1 hit2 0.2 0.9
2 group1 hit3 0.2 0.9
group2 3 group2 hit1 0.1 0.3
5 group2 hit3 0.3 0.3
CodePudding user response:
Try using rank
:
>>> df[df.groupby("query")["score1"].transform("rank").le(2)]
query hit score1 score2
1 group1 hit2 0.2 0.9
2 group1 hit3 0.2 0.9
3 group2 hit1 0.1 0.3
5 group2 hit3 0.3 0.3
Alternatively with nsmallest
and loc
:
>>> df.loc[df.groupby('query')['score1'].nsmallest(2).reset_index(level=0, drop=True).index]
query hit score1 score2
1 group1 hit2 0.2 0.9
2 group1 hit3 0.2 0.9
3 group2 hit1 0.1 0.3
5 group2 hit3 0.3 0.3