I have copied a table with three columns from a pdf file. I am attaching the screenshot from the PDF here:
The values in the column padj are exponential values, however, when you copy from the pdf to an excel and then open it with pandas, these are strings or object data types. Hence, these values cannot be parsed as floats or numeric values. I need these values as floats, not as strings. Can someone help me with some suggestions? So far this is what I have tried.
The excel or the csv
file is then opened in python using the escape_unicode
encoding in order to circumvent the UnicodeDecodeError
## open the file
df = pd.read_csv("S2_GSE184956.csv",header=0,sep=',',encoding='unicode_escape')[["DEGs","LFC","padj"]]
df.head()
DEGs padj LFC
0 JUNB 1.5 ×10-8 -1.273329
1 HOOK2 2.39×10-7 -1.109320
2 EGR1 3.17×10-6 -4.187828
3 DUSP1 3.95×10-6 -3.251030
4 IL6 3.95×10-6 -3.415500
5 ARL4C 5.06×10-6 -2.147519
6 NR4A2 2.94×10-4 -3.001167
7 CCL3L1 4.026×10-4 -5.293694
# Convert the string to float by replacing the x10- with exponential sign
df['padj'] = df['padj'].apply(lambda x: (unidecode(x).replace('x10-','x10-e'))).astype(float)
That threw an error,
ValueError: could not convert string to float: '1.5 x10-e8'
Any suggestions would be appreciated. Thanks
CodePudding user response:
Considering your sample (column padj), the code below should work:
f_value = eval(str_float.replace('x10', 'e').replace(' ', ''))
CodePudding user response:
With the dataframe shared in the question on
CodePudding user response:
If you want a numerical vectorial solution, you can use:
df['float'] = (df['padj'].str.extract(r'(\d (?:\.\d ))\s*×10(.?\d )')
.apply(pd.to_numeric).pipe(lambda d: d[0].mul(10.**d[1]))
)
output:
DEGs padj LFC float
0 JUNB 1.5 ×10-8 -1.273329 1.500000e-08
1 HOOK2 2.39×10-7 -1.109320 2.390000e-07
2 EGR1 3.17×10-6 -4.187828 3.170000e-06
3 DUSP1 3.95×10-6 -3.251030 3.950000e-06
4 IL6 3.95×10-6 -3.415500 3.950000e-06
5 ARL4C 5.06×10-6 -2.147519 5.060000e-06
6 NR4A2 2.94×10-4 -3.001167 2.940000e-04
7 CCL3L1 4.026×10-4 -5.293694 4.026000e-04
Intermediate:
df['padj'].str.extract('(\d (?:\.\d ))\s*×10(.?\d )')
0 1
0 1.5 -8
1 2.39 -7
2 3.17 -6
3 3.95 -6
4 3.95 -6
5 5.06 -6
6 2.94 -4
7 4.026 -4