Sorting column of numerical string data in pandas-CodePudding

I have a column of about 100 values; mixed integers and decimals (eg 27, 27.2, 28) but they're stored as datatype string (eg '27', '27.2', '28'). The data was compiled from multiple sources and some of those compiling the data did not have the precision necessary for the decimal values and so entered data with '>' or "<' characters. So add a '>27' to the example column:

col_1
27
>27
27.2
28

The values are out of sort and I would like to sort them from lowest to highest and convert them back to datatype string. My solution is to convert everything to a float, sort on the numerical values and then convert everything back but the values without precision are getting in the way. My thinking was to add characters to the end of those values, say '.001', and removing the ">" and "<" characters before converting, sorting, and then converting everything back. So, when doing the operation to add '0.001' to the string value I do this:

df['col_1'].loc[df['col_1].str.contains('>')] = df['col_1'].loc[df['col_1].str.contains('>')]   '.001'

Is there a better, more acceptable, or maybe more efficient way to do this?

CodePudding user response：

Here is one way to approach the problem

s = df['col_1'].str.extract(r'(<|>)?(.*)')
s[0], s[1] = s[0].fillna('='), s[1].astype(float)

df_new = df.loc[s.sort_values([1, 0]).index]

Step by step details

Extract symbols and numbers

>>> s

     0     1
0  NaN    27
1    >    27
2  NaN  27.2
3  NaN    28
4    <    27

Fill the NaN values in column 0 with =, since < < = < >. Then change the dtype of column 1 to float

Sort the above dataframe and get the sorted index, then use the sorted index to sort the given dataframe

>>> df_new

  col_1
4   <27
0    27
1   >27
2  27.2
3    28

CodePudding user response：

use an apply and a regular expression to transform the string data into a float for sorting.

col_1=['27','>27','27.2','28']
df=pd.DataFrame({'col_1':col_1})

df['sort']=df['col_1'].apply(lambda x: float(re.sub('>','',x)) 0.001 if '>' in x else float(x))

print(df)

output

  col_1    sort
0    27  27.000
1   >27  27.001
2  27.2  27.200
3    28  28.000

CodePudding user response：

diff = df['col_1'].str.contains(r'<').map({True:-0.001})
diff.update(df['col_1'].str.contains(r'>').map({True:0.001}))
diff = diff.fillna(0)

df['col_1'] = df['col_1'].str.replace(r'[^0-9.eE-]', '', regex=True).astype(float)   diff
df = df.sort_values('col_1', ascending=False)

Output:

>>> df
  col_1
0   <26
1    27
2   >27
3  27.2
4    28

>>> diff = df['col_1'].str.contains(r'<').map({True:-0.001})
>>> diff.update(df['col_1'].str.contains(r'>').map({True:0.001}))
>>> diff = diff.fillna(0)
>>> diff
0   -0.001
1    0.000
2    0.001
3    0.000
4    0.000
Name: col_1, dtype: float64

>>> df['col_1'] = df['col_1'].str.replace(r'[^0-9.eE-]', '', regex=True).astype(float)   diff
>>> df = df.sort_values('col_1', ascending=False)
>>> df
    col_1
0  25.999
1  27.000
2  27.001
3  27.200
4  28.000

CodePudding user response：

Use pd.eval. Replace '>' and '<' by '0.001 ' and '-0.001 ' then evaluate the expression to convert as float numbers. Finally, use sort_values and reindex to sort your original dataframe.

Setup an example:

df = pd.DataFrame({'col_1': ['>27', '28', '27', '<27', '27.2'], 
                   'col_2': range(11, 16)})
print(df)

# Output:
  col_1  col_2
0   >27     11
1    28     12
2    27     13
3   <27     14
4  27.2     15

out = df.reindex(df['col_1'].replace({'>': '0.001 ', '<': '-0.001 '}, regex=True) \
                            .apply(pd.eval).sort_values().index)
print(out)

# Output:
  col_1  col_2
3   <27     14
2    27     13
0   >27     11
4  27.2     15
1    28     12