I have a .tsv file like this:
sequences | label |
---|---|
[[0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0],[0.0, 0.0, 1.0, 0.0],[0.0, 0.0, 1.0, 0.0],[0.0, 0.0, 1.0, 0.0],[0.0, 0.0, 1.0, 0.0]] | 1 |
I want to import the column sequences
in pd.DataFrame as np.float64
.
But it turns out like this:
df = pd.read_csv('AARS.tsv', sep='\t', dtype = np.float64)
ValueError: could not convert string to float
I would be grateful if you can give me any suggestions!
Many thanks!
CodePudding user response:
Your first column does not look like it is a float64
.
You could leave out the dtype=...
, and check the type of the data:
import pandas as pd
import numpy as np
df = pd.read_csv('aars.tsv', sep='\t', usecols=['label','sequence'])
for item in df.values:
for i in range(item.size):
print(type(item[i]), end=" ")
print()
This will output something like (when I created your input correct, I added a line with column titles):
<class 'str'> <class 'int'>
CodePudding user response:
Here is a proposition using some of the pandas StringMethods
and pandas.Series.explode
:
import pandas as pd
out= (
pd.read_csv("AARS.tsv", sep="\t", usecols=["sequences"])
.assign(temp= lambda x: x["sequences"].str.strip("[]")
.str.replace("\]\s*,\s*\[", ", ",
regex=True)
.str.split(","))
.explode("temp")
.astype(float)
.values
)
# Output:
print(out)
[[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]]
print(type(out)
numpy.ndarray
If you need to reshape your array to 2D, use numpy.reshape
:
print(np.reshape(out, (-1, 2)))
[[0. 1.]
[0. 0.]
[0. 0.]
[1. 0.]
[0. 0.]
[1. 0.]
[0. 0.]
[1. 0.]
[0. 0.]
[1. 0.]
[0. 0.]
[1. 0.]]