Dataframing when the numbers are repeated-CodePudding

a) How to find the largest five SNRs when the ID and SNRs are repeated? And also I want all these three columns as the output. b) I also want the eliminated lines as the output.

         FIT             ID                   SNR
    1011563.fit,  J16142485-3141000 ,       36   
    1011729.fit,  J17210134-3757437 ,       18   
    1011730.fit,  J17210134-3757437 ,       20   
    1011731.fit,  J17210134-3757437 ,       20   
    1011732.fit,  J17210134-3757437 ,       13   
    1011914.fit,  J17210134-3757437 ,       38   
    1011915.fit,  J17210134-3757437 ,       26   
    1011916.fit,  J17210134-3757437 ,       19   
    1011917.fit,  J17210134-3757437 ,       47   
    1011918.fit,  J17210134-3757437 ,       25 ´´´   
The result should look somewhat like this.

  Expected output for a.

          FITS                    ID  SNR
```8  1011917.fit    J17210134-3757437    47
   5  1011914.fit    J17210134-3757437    38
   0  1011563.fit    J16142485-3141000    36
   6  1011915.fit    J17210134-3757437    26
   9  1011918.fit    J17210134-3757437    25
   3  1011731.fit    J17210134-3757437   20
   2  1011730.fit    J17210134-3757437    20 ´´´

Expected output for b) 
 
```          FITS                    ID  SNR
     1  1011729.fit    J17210134-3757437    18
     6  1011915.fit    J17210134-3757437    26
     7  1011916.fit    J17210134-3757437    19
     4  1011732.fit    J17210134-3757437   13´´´

CodePudding user response：

You can use groupby_rank:

rank = df.groupby('ID')['SNR'].rank(method='dense', ascending=False)

a = df[rank <= 5]
b = df[rank > 5]

Output:

>>> a
           FIT                 ID  SNR
0  1011563.fit  J16142485-3141000   36
2  1011730.fit  J17210134-3757437   20
3  1011731.fit  J17210134-3757437   20
5  1011914.fit  J17210134-3757437   38
6  1011915.fit  J17210134-3757437   26
8  1011917.fit  J17210134-3757437   47
9  1011918.fit  J17210134-3757437   25

>>> b
           FIT                 ID  SNR
1  1011729.fit  J17210134-3757437   18
4  1011732.fit  J17210134-3757437   13
7  1011916.fit  J17210134-3757437   19

CodePudding user response：

You can get the min of the largest values per group, then slice:

thresh = df.groupby('ID')['SNR'].nlargest(5).groupby(level=0).min()

m = df['ID'].map(thresh).le(df['SNR'])

a = df[m]

b = df[~m]

output:

# tresh
ID
J16142485-3141000    36
J17210134-3757437    20
Name: SNR, dtype: int64

# a
           FIT                 ID  SNR
0  1011563.fit  J16142485-3141000   36
2  1011730.fit  J17210134-3757437   20
3  1011731.fit  J17210134-3757437   20
5  1011914.fit  J17210134-3757437   38
6  1011915.fit  J17210134-3757437   26
8  1011917.fit  J17210134-3757437   47
9  1011918.fit  J17210134-3757437   25

# b
           FIT                 ID  SNR
1  1011729.fit  J17210134-3757437   18
4  1011732.fit  J17210134-3757437   13
7  1011916.fit  J17210134-3757437   19