String type to array or list pandas column-CodePudding

I have pandas dataframe as below:

        id                                   emb    
0   529581720   [-0.06815625727176666, 0.054927315562963486, 0...   
1   663817504   [-0.05805483087897301, 0.031277190893888474, 0...   
2   507084910   [-0.07410381734371185, -0.03922194242477417, 0...   
3   1774950548  [-0.09088297933340073, -0.04383128136396408, -...   
4   725573369   [-0.06329705566167831, 0.01242107804864645, 0....

data types of emb column is object. Now I want to convert those into numpy array. So I tried following:

embd = df[embd].values

But as it's in string format I'm getting following output:

embd[0]

out:
array('[-0.06815625727176666, 0.054927315562963486, 0.056555990129709244, -0.04559280723333359, -0.025042753666639328, -0.06674829870462418, -0.027613995596766472, 
0.05307046324014664, 0.020159300416707993, 0.012015435844659805, 0.07048438489437103, 
-0.020022081211209297, -0.03899797052145004, -0.03358669579029083, -0.06369364261627197, 
-0.045727960765361786, -0.05619484931230545, -0.07043793052434921, -0.07021039724349976, 
2.8020248282700777E-4, -0.04271571710705757, -0.04004468396306038, 0.01802503503859043, -0.0553901381790638, 0.0068290019407868385, -0.021117383614182472, -0.06583991646766663]',
      dtype='<U11190')

Can someone tell me how can I convert this successfully into array with float32 values.

CodePudding user response：

You can use the numpy function numpy.array() to convert an array of strings to an array with float32 values. Here is an example:

import numpy as np

string_array = ["1.0", "2.5", "3.14"]

float_array = np.array(string_array, dtype=np.float32)

Alternatively, you can use the pandas function pandas.to_numeric() to convert the values of a column of a dataframe from string to float32. Here is an example:

import pandas as pd

df = pd.DataFrame({"A": ["1.0", "2.5", "3.14"]})
df["A"] = pd.to_numeric(df["A"], downcast='float')

You can also use the pd.to_numeric() method and catch the errors that might arise when trying to convert the string to float, using the errors='coerce' argument. This will replace the invalid string values with NaN.

df['A'] = pd.to_numeric(df['A'], errors='coerce')

CodePudding user response：

Use ast.literal_eval:

import ast

df['emb'] = df['emb'].apply(ast.literal_eval)

Output:

>>> df['emb'].values
array([list([-0.06815625727176666, 0.054927315562963486]),
       list([-0.05805483087897301, 0.031277190893888474]),
       list([-0.07410381734371185, -0.03922194242477417]),
       list([-0.09088297933340073, -0.04383128136396408]),
       list([-0.06329705566167831, 0.01242107804864645])], dtype=object)

>>> np.stack(df['emb'].values)
array([[-0.06815626,  0.05492732],
       [-0.05805483,  0.03127719],
       [-0.07410382, -0.03922194],
       [-0.09088298, -0.04383128],
       [-0.06329706,  0.01242108]])

Alternative to store list as numpy array:

df['emb'] = df['emb'].apply(lambda x: np.array(ast.literal_eval(x)))