Error while reading csv file: converting a column from string to float-CodePudding

I am trying to read a csv file that contains a column, SpType, in which there are String values. My variable is being converted into an object, but I need it to be float type. Here's the snippet:

data = pd.read_csv("/content/Star3642_balanced.csv")

X_orig = data[["Vmag", "Plx", "e_Plx", "B-V", "SpType", "Amag"]].to_numpy()

Here's what's giving me the error:

X = torch.tensor(X_orig, dtype=torch.float32)

The error reads "can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool."

I tried doing this after reading the csv file, but it didn't help:

data["SpType"] = data.SpType.astype(float)

Can someone please tell me what can be done about this?

CodePudding user response：

Strings should be encoded into numeric values. The easiest way would be using Pandas one-hot encoding (that will create lots of extra columns in this case, but a neural network should process those without much effort):

ohe = pd.get_dummies(data["SpType"], drop_first=True)
data[ohe.columns] = ohe
data = data.drop(["SpType"], axis=1)

Alternatively, you may use sklearn encoders or category_encoders library - more complex encoding might require to process the test set separately to avoid the target leakage.