I have a dataframe like this:
Seq Value
20-35-ABCDE 14268142.986651151
21-33-ABEDFD 4204281194.109206
61-72-ASDASD 172970.7123134008912
61-76-ASLDKAS 869238.232460215262
63-72-ASDASD string1
63-76-OIASD 20823821.49471747433
64-76-ASDAS(s)D string1
65-72-AS*AS 8762472.99003354316
65-76-ASYAD*S* 32512348.3285536161
66-76-A(AD(AD)) 3843230.72933184169
I want to rank the rows based on the Value, highest to lowest, and return the top 50% of the rows (where the row number could change over time).
I wrote this to do the ranking:
df = pd.read_csv(sys.argv[1],sep='\t')
df.columns =['Seq', 'Value']
#cant do because there are some strings
#pd.to_numeric(df['Value'])
df2 = df.sort_values(['Value'], ascending=True).head(10)
print(df2)
The output is like this:
Seq Value
17210 ASK1 0.0
15061 ASD**ASDHA 0.0
41110 ASD(£)DA 1.4355078174305618
50638 EILMH 1000.7985554926368
62019 VSEFMTRLF 10000.89805735126
41473 LEDSAGES 10002.182707004016
41473 LEDSASDES 10000886.012834921
So I guess it sorted them by string instead of floats, but I'm struggling to understand how to sort by float because some of the entries in that column say string1 (and I want all the string1 to go to the end of the list, i.e. I want to sort by all the floats, and then just put all the string1s at the end), and then I want to be able to return the Seq values in the top 50% of the sorted rows.
Can someone help me with this, even just the sorting part?
CodePudding user response:
The problem is that your column is storing the values as strings, so they will sort according to string sorting, not numeric sorting. You can sort numerically using the key
of DataFrame.sort_values
, which also allows you to preserve the string values in that column.
Another option would be to turn that column into a numeric column before the sort, but then non-numeric values must be replaced with NaN
Sample data
import pandas as pd
df = pd.DataFrame({'Seq': [1,2,3,4,5],
'Value': ['11', '2', '1.411', 'string1', '91']})
# String sorting
df.sort_values('Value')
#Seq Value
#2 3 1.411
#0 1 11
#1 2 2
#4 5 91
#3 4 string1
Code
# Numeric sorting
df.sort_values('Value', key=lambda x: pd.to_numeric(x, errors='coerce'))
Seq Value
2 3 1.411
1 2 2
0 1 11
4 5 91
3 4 string1
CodePudding user response:
Honestly, I think it would make more sense to use one column for the float values and one for the strings. That said, you can convert to numeric only for the sorting using the key
parameter of sort_values
. The NaN/strings will be pushed to the end.
df.sort_values(by='Value', key=lambda x: pd.to_numeric(x, errors='coerce'))
output:
Seq Value
2 61-72-ASDASD 172970.7123134008912
3 61-76-ASLDKAS 869238.232460215262
9 66-76-A(AD(AD)) 3843230.72933184169
7 65-72-AS*AS 8762472.99003354316
0 20-35-ABCDE 14268142.986651151
5 63-76-OIASD 20823821.49471747433
8 65-76-ASYAD*S* 32512348.3285536161
1 21-33-ABEDFD 4204281194.109206
4 63-72-ASDASD string1
6 64-76-ASDAS(s)D string1
alternative splitting the floats and strings apart:
s = df['Value']
(df.assign(Value=pd.to_numeric(s, errors='coerce'),
Strings=lambda d: s.where(d['Value'].isna())
)
.sort_values(by=['Value', 'Strings'])
)
output:
Seq Value Strings
2 61-72-ASDASD 1.729707e 05 NaN
3 61-76-ASLDKAS 8.692382e 05 NaN
9 66-76-A(AD(AD)) 3.843231e 06 NaN
7 65-72-AS*AS 8.762473e 06 NaN
0 20-35-ABCDE 1.426814e 07 NaN
5 63-76-OIASD 2.082382e 07 NaN
8 65-76-ASYAD*S* 3.251235e 07 NaN
1 21-33-ABEDFD 4.204281e 09 NaN
4 63-72-ASDASD NaN string1
6 64-76-ASDAS(s)D NaN string1