I have a column of about 100 values; mixed integers and decimals (eg 27
, 27.2
, 28
) but they're stored as datatype string (eg '27'
, '27.2'
, '28'
). The data was compiled from multiple sources and some of those compiling the data did not have the precision necessary for the decimal values and so entered data with '>' or "<' characters. So add a '>27'
to the example column:
col_1
27
>27
27.2
28
The values are out of sort and I would like to sort them from lowest to highest and convert them back to datatype string. My solution is to convert everything to a float, sort on the numerical values and then convert everything back but the values without precision are getting in the way.
My thinking was to add characters to the end of those values, say '.001'
, and removing the ">" and "<" characters before converting, sorting, and then converting everything back. So, when doing the operation to add '0.001'
to the string value I do this:
df['col_1'].loc[df['col_1].str.contains('>')] = df['col_1'].loc[df['col_1].str.contains('>')] '.001'
Is there a better, more acceptable, or maybe more efficient way to do this?
CodePudding user response:
Here is one way to approach the problem
s = df['col_1'].str.extract(r'(<|>)?(.*)')
s[0], s[1] = s[0].fillna('='), s[1].astype(float)
df_new = df.loc[s.sort_values([1, 0]).index]
Step by step details
- Extract symbols and numbers
>>> s
0 1
0 NaN 27
1 > 27
2 NaN 27.2
3 NaN 28
4 < 27
- Fill the
NaN
values in column0
with=
, since<
<=
<>
. Then change thedtype
of column1
tofloat
>>> s
0 1
0 = 27.0
1 > 27.0
2 = 27.2
3 = 28.0
4 < 27.0
- Sort the above dataframe and get the sorted index, then use the sorted index to sort the given dataframe
>>> df_new
col_1
4 <27
0 27
1 >27
2 27.2
3 28
CodePudding user response:
use an apply and a regular expression to transform the string data into a float for sorting.
col_1=['27','>27','27.2','28']
df=pd.DataFrame({'col_1':col_1})
df['sort']=df['col_1'].apply(lambda x: float(re.sub('>','',x)) 0.001 if '>' in x else float(x))
print(df)
output
col_1 sort
0 27 27.000
1 >27 27.001
2 27.2 27.200
3 28 28.000
CodePudding user response:
diff = df['col_1'].str.contains(r'<').map({True:-0.001})
diff.update(df['col_1'].str.contains(r'>').map({True:0.001}))
diff = diff.fillna(0)
df['col_1'] = df['col_1'].str.replace(r'[^0-9.eE-]', '', regex=True).astype(float) diff
df = df.sort_values('col_1', ascending=False)
Output:
>>> df
col_1
0 <26
1 27
2 >27
3 27.2
4 28
>>> diff = df['col_1'].str.contains(r'<').map({True:-0.001})
>>> diff.update(df['col_1'].str.contains(r'>').map({True:0.001}))
>>> diff = diff.fillna(0)
>>> diff
0 -0.001
1 0.000
2 0.001
3 0.000
4 0.000
Name: col_1, dtype: float64
>>> df['col_1'] = df['col_1'].str.replace(r'[^0-9.eE-]', '', regex=True).astype(float) diff
>>> df = df.sort_values('col_1', ascending=False)
>>> df
col_1
0 25.999
1 27.000
2 27.001
3 27.200
4 28.000
CodePudding user response:
Use pd.eval
. Replace '>' and '<' by '0.001 ' and '-0.001 ' then evaluate the expression to convert as float numbers. Finally, use sort_values
and reindex
to sort your original dataframe.
Setup an example:
df = pd.DataFrame({'col_1': ['>27', '28', '27', '<27', '27.2'],
'col_2': range(11, 16)})
print(df)
# Output:
col_1 col_2
0 >27 11
1 28 12
2 27 13
3 <27 14
4 27.2 15
out = df.reindex(df['col_1'].replace({'>': '0.001 ', '<': '-0.001 '}, regex=True) \
.apply(pd.eval).sort_values().index)
print(out)
# Output:
col_1 col_2
3 <27 14
2 27 13
0 >27 11
4 27.2 15
1 28 12