how to convert a lista of vectors into a numpy array to train a classifier in python?-CodePudding

I have a pandas data frame that looks like this:

                          corpus             tfidf            labels
0   dfnkdfnkf asdfhedfh ajdladja    [0.0, 0.0, 0.0, 0.01, 0.8]  60
1   dfnkdfnkf asdfhedfh ajdladja    [0.0, 0.0, 0.0, 0.01, 0.8]  73
2   dfnkdfnkf asdfhedfh ajdladja    [0.0, 0.0, 0.0, 0.01, 0.8]  61

my desired output is this:

                           corpus            tfidf            labels
0   dfnkdfnkf asdfhedfh ajdladja    0.0, 0.0, 0.0, 0.01, 0.8    60
1   dfnkdfnkf asdfhedfh ajdladja    0.0, 0.0, 0.0, 0.01, 0.8    73
2   dfnkdfnkf asdfhedfh ajdladja    0.0, 0.0, 0.0, 0.01, 0.8    61

I want to unlist the column tfidf in order to create a numpy array to train a decision tree classifier.

x= df['tfidf'].values
y= df['labels'].values

from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 
0.25, random_state=0)  


from sklearn.tree import DecisionTreeClassifier
classifier= DecisionTreeClassifier(criterion='entropy', 
random_state=0)  
classifier.fit(x_train, y_train)

When I tried the code above I got an error:

TypeError                                 Traceback (most recent 
call last)
TypeError: float() argument must be a string or a number, not 'list'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent 
call last)
<ipython-input-103-8aa769130bba> in <module>()
  1 from sklearn.tree import DecisionTreeClassifier
  2 classifier= DecisionTreeClassifier(criterion='entropy', 
random_state=0)
----> 3 classifier.fit(x_train, y_train)enter code here

What can I do to get the data frame ready for training?

CodePudding user response：

You can explode the lists from the tfidf column into multiple rows and then cast these values to a NumPy array, reshaping it appropriately:

import numpy as np

n_rows = df.shape[0]
n_cols = len(df.loc[0, 'tfidf'])

X = np.array(df['tfidf'].explode().values,
             dtype='float').reshape(n_rows, n_cols)
X

array([[0.  , 0.  , 0.  , 0.01, 0.8 ],
       [0.  , 0.  , 0.  , 0.01, 0.8 ],
       [0.  , 0.  , 0.  , 0.01, 0.8 ]])

CodePudding user response：

In the first display

                          corpus             tfidf            labels
0   dfnkdfnkf asdfhedfh ajdladja    [0.0, 0.0, 0.0, 0.01, 0.8]  60
 ...

values in the tfidf column are Python lists (through strings and numpy arrays display the same).

df[col].values will produce a 1d object dtype array containing these lists.

x = np.stack(df[col].values) has a chance of turning that into a 2d float dtype array.

The second "unlisted" display is not valid - unless you strip the [] off string elements.

                           corpus            tfidf            labels
0   dfnkdfnkf asdfhedfh ajdladja    0.0, 0.0, 0.0, 0.01, 0.8    60

Dataframes with list or array elements are something of an anomoly, and many beginner user aren't prepared to deal with them. Frames are easiest to work with when the cell values are strings or numbers. But even strings are stored as Python objects.