Home > Back-end >  String type to array or list pandas column
String type to array or list pandas column

Time:01-17

I have pandas dataframe as below:

        id                                   emb    
0   529581720   [-0.06815625727176666, 0.054927315562963486, 0...   
1   663817504   [-0.05805483087897301, 0.031277190893888474, 0...   
2   507084910   [-0.07410381734371185, -0.03922194242477417, 0...   
3   1774950548  [-0.09088297933340073, -0.04383128136396408, -...   
4   725573369   [-0.06329705566167831, 0.01242107804864645, 0....

data types of emb column is object. Now I want to convert those into numpy array. So I tried following:

embd = df[embd].values

But as it's in string format I'm getting following output:

embd[0]

out:
array('[-0.06815625727176666, 0.054927315562963486, 0.056555990129709244, -0.04559280723333359, -0.025042753666639328, -0.06674829870462418, -0.027613995596766472, 
0.05307046324014664, 0.020159300416707993, 0.012015435844659805, 0.07048438489437103, 
-0.020022081211209297, -0.03899797052145004, -0.03358669579029083, -0.06369364261627197, 
-0.045727960765361786, -0.05619484931230545, -0.07043793052434921, -0.07021039724349976, 
2.8020248282700777E-4, -0.04271571710705757, -0.04004468396306038, 0.01802503503859043, -0.0553901381790638, 0.0068290019407868385, -0.021117383614182472, -0.06583991646766663]',
      dtype='<U11190')

Can someone tell me how can I convert this successfully into array with float32 values.

CodePudding user response:

You can use the numpy function numpy.array() to convert an array of strings to an array with float32 values. Here is an example:

import numpy as np

string_array = ["1.0", "2.5", "3.14"]

float_array = np.array(string_array, dtype=np.float32)

Alternatively, you can use the pandas function pandas.to_numeric() to convert the values of a column of a dataframe from string to float32. Here is an example:

import pandas as pd

df = pd.DataFrame({"A": ["1.0", "2.5", "3.14"]})
df["A"] = pd.to_numeric(df["A"], downcast='float')

You can also use the pd.to_numeric() method and catch the errors that might arise when trying to convert the string to float, using the errors='coerce' argument. This will replace the invalid string values with NaN.

df['A'] = pd.to_numeric(df['A'], errors='coerce')

CodePudding user response:

Use ast.literal_eval:

import ast

df['emb'] = df['emb'].apply(ast.literal_eval)

Output:

>>> df['emb'].values
array([list([-0.06815625727176666, 0.054927315562963486]),
       list([-0.05805483087897301, 0.031277190893888474]),
       list([-0.07410381734371185, -0.03922194242477417]),
       list([-0.09088297933340073, -0.04383128136396408]),
       list([-0.06329705566167831, 0.01242107804864645])], dtype=object)

>>> np.stack(df['emb'].values)
array([[-0.06815626,  0.05492732],
       [-0.05805483,  0.03127719],
       [-0.07410382, -0.03922194],
       [-0.09088298, -0.04383128],
       [-0.06329706,  0.01242108]])

Alternative to store list as numpy array:

df['emb'] = df['emb'].apply(lambda x: np.array(ast.literal_eval(x)))
  • Related