Convert the string into a float value-CodePudding

I have copied a table with three columns from a pdf file. I am attaching the screenshot from the PDF here:

The values in the column padj are exponential values, however, when you copy from the pdf to an excel and then open it with pandas, these are strings or object data types. Hence, these values cannot be parsed as floats or numeric values. I need these values as floats, not as strings. Can someone help me with some suggestions? So far this is what I have tried.

The excel or the csv file is then opened in python using the escape_unicode encoding in order to circumvent the UnicodeDecodeError

## open the file
df       = pd.read_csv("S2_GSE184956.csv",header=0,sep=',',encoding='unicode_escape')[["DEGs","LFC","padj"]]

   df.head()
        DEGs   padj          LFC
    0   JUNB    1.5 ×10-8   -1.273329
    1   HOOK2   2.39×10-7   -1.109320
    2   EGR1    3.17×10-6   -4.187828
    3   DUSP1   3.95×10-6   -3.251030
    4   IL6     3.95×10-6   -3.415500
    5   ARL4C   5.06×10-6   -2.147519
    6   NR4A2   2.94×10-4   -3.001167
    7   CCL3L1  4.026×10-4  -5.293694


# Convert the string to float by replacing the x10- with exponential sign 
    df['padj'] = df['padj'].apply(lambda x: (unidecode(x).replace('x10-','x10-e'))).astype(float)
That threw an error,

    ValueError: could not convert string to float: '1.5 x10-e8'

Any suggestions would be appreciated. Thanks

CodePudding user response：

Considering your sample (column padj), the code below should work:

f_value = eval(str_float.replace('x10', 'e').replace(' ', ''))

CodePudding user response：

With the dataframe shared in the question on

CodePudding user response：

If you want a numerical vectorial solution, you can use:

df['float'] = (df['padj'].str.extract(r'(\d (?:\.\d ))\s*×10(.?\d )')
 .apply(pd.to_numeric).pipe(lambda d: d[0].mul(10.**d[1]))
)

output:

     DEGs        padj       LFC         float
0    JUNB   1.5 ×10-8 -1.273329  1.500000e-08
1   HOOK2   2.39×10-7 -1.109320  2.390000e-07
2    EGR1   3.17×10-6 -4.187828  3.170000e-06
3   DUSP1   3.95×10-6 -3.251030  3.950000e-06
4     IL6   3.95×10-6 -3.415500  3.950000e-06
5   ARL4C   5.06×10-6 -2.147519  5.060000e-06
6   NR4A2   2.94×10-4 -3.001167  2.940000e-04
7  CCL3L1  4.026×10-4 -5.293694  4.026000e-04

Intermediate:

df['padj'].str.extract('(\d (?:\.\d ))\s*×10(.?\d )')

       0   1
0    1.5  -8
1   2.39  -7
2   3.17  -6
3   3.95  -6
4   3.95  -6
5   5.06  -6
6   2.94  -4
7  4.026  -4