Extract subset of dataframe by value of column - column datatypes are mixed-CodePudding

I have a dataframe like this:

Seq              Value
20-35-ABCDE     14268142.986651151
21-33-ABEDFD    4204281194.109206
61-72-ASDASD    172970.7123134008912
61-76-ASLDKAS   869238.232460215262
63-72-ASDASD    string1
63-76-OIASD     20823821.49471747433
64-76-ASDAS(s)D string1
65-72-AS*AS     8762472.99003354316
65-76-ASYAD*S*  32512348.3285536161
66-76-A(AD(AD))  3843230.72933184169

I want to rank the rows based on the Value, highest to lowest, and return the top 50% of the rows (where the row number could change over time).

I wrote this to do the ranking:

df = pd.read_csv(sys.argv[1],sep='\t')
df.columns =['Seq', 'Value']

#cant do because there are some strings
#pd.to_numeric(df['Value'])

df2 = df.sort_values(['Value'], ascending=True).head(10)
print(df2)

The output is like this:

Seq            Value
17210  ASK1                     0.0
15061  ASD**ASDHA               0.0
41110  ASD(£)DA                 1.4355078174305618
50638  EILMH                    1000.7985554926368
62019  VSEFMTRLF                10000.89805735126
41473  LEDSAGES                 10002.182707004016
41473  LEDSASDES                10000886.012834921

So I guess it sorted them by string instead of floats, but I'm struggling to understand how to sort by float because some of the entries in that column say string1 (and I want all the string1 to go to the end of the list, i.e. I want to sort by all the floats, and then just put all the string1s at the end), and then I want to be able to return the Seq values in the top 50% of the sorted rows.

Can someone help me with this, even just the sorting part?

CodePudding user response：

The problem is that your column is storing the values as strings, so they will sort according to string sorting, not numeric sorting. You can sort numerically using the key of DataFrame.sort_values, which also allows you to preserve the string values in that column.

Another option would be to turn that column into a numeric column before the sort, but then non-numeric values must be replaced with NaN

Sample data

import pandas as pd
df = pd.DataFrame({'Seq': [1,2,3,4,5],
                   'Value': ['11', '2', '1.411', 'string1', '91']})

# String sorting
df.sort_values('Value')
#Seq    Value
#2    3    1.411
#0    1       11
#1    2        2
#4    5       91
#3    4  string1

Code

# Numeric sorting
df.sort_values('Value', key=lambda x: pd.to_numeric(x, errors='coerce'))

   Seq    Value
2    3    1.411
1    2        2
0    1       11
4    5       91
3    4  string1

CodePudding user response：

Honestly, I think it would make more sense to use one column for the float values and one for the strings. That said, you can convert to numeric only for the sorting using the key parameter of sort_values. The NaN/strings will be pushed to the end.

df.sort_values(by='Value', key=lambda x: pd.to_numeric(x, errors='coerce'))

output:

               Seq                 Value
2     61-72-ASDASD  172970.7123134008912
3    61-76-ASLDKAS   869238.232460215262
9  66-76-A(AD(AD))   3843230.72933184169
7      65-72-AS*AS   8762472.99003354316
0      20-35-ABCDE    14268142.986651151
5      63-76-OIASD  20823821.49471747433
8   65-76-ASYAD*S*   32512348.3285536161
1     21-33-ABEDFD     4204281194.109206
4     63-72-ASDASD               string1
6  64-76-ASDAS(s)D               string1

alternative splitting the floats and strings apart:

s = df['Value']
(df.assign(Value=pd.to_numeric(s, errors='coerce'),
           Strings=lambda d: s.where(d['Value'].isna())
          )
   .sort_values(by=['Value', 'Strings'])
)

output:

               Seq         Value  Strings
2     61-72-ASDASD  1.729707e 05      NaN
3    61-76-ASLDKAS  8.692382e 05      NaN
9  66-76-A(AD(AD))  3.843231e 06      NaN
7      65-72-AS*AS  8.762473e 06      NaN
0      20-35-ABCDE  1.426814e 07      NaN
5      63-76-OIASD  2.082382e 07      NaN
8   65-76-ASYAD*S*  3.251235e 07      NaN
1     21-33-ABEDFD  4.204281e 09      NaN
4     63-72-ASDASD           NaN  string1
6  64-76-ASDAS(s)D           NaN  string1