I have a pandas data frame that looks like this:
corpus tfidf labels
0 dfnkdfnkf asdfhedfh ajdladja [0.0, 0.0, 0.0, 0.01, 0.8] 60
1 dfnkdfnkf asdfhedfh ajdladja [0.0, 0.0, 0.0, 0.01, 0.8] 73
2 dfnkdfnkf asdfhedfh ajdladja [0.0, 0.0, 0.0, 0.01, 0.8] 61
my desired output is this:
corpus tfidf labels
0 dfnkdfnkf asdfhedfh ajdladja 0.0, 0.0, 0.0, 0.01, 0.8 60
1 dfnkdfnkf asdfhedfh ajdladja 0.0, 0.0, 0.0, 0.01, 0.8 73
2 dfnkdfnkf asdfhedfh ajdladja 0.0, 0.0, 0.0, 0.01, 0.8 61
I want to unlist the column tfidf in order to create a numpy array to train a decision tree classifier.
x= df['tfidf'].values
y= df['labels'].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=
0.25, random_state=0)
from sklearn.tree import DecisionTreeClassifier
classifier= DecisionTreeClassifier(criterion='entropy',
random_state=0)
classifier.fit(x_train, y_train)
When I tried the code above I got an error:
TypeError Traceback (most recent
call last)
TypeError: float() argument must be a string or a number, not 'list'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent
call last)
<ipython-input-103-8aa769130bba> in <module>()
1 from sklearn.tree import DecisionTreeClassifier
2 classifier= DecisionTreeClassifier(criterion='entropy',
random_state=0)
----> 3 classifier.fit(x_train, y_train)enter code here
What can I do to get the data frame ready for training?
CodePudding user response:
You can explode
the lists from the tfidf column into multiple rows and then cast these values to a NumPy array, reshaping it appropriately:
import numpy as np
n_rows = df.shape[0]
n_cols = len(df.loc[0, 'tfidf'])
X = np.array(df['tfidf'].explode().values,
dtype='float').reshape(n_rows, n_cols)
X
array([[0. , 0. , 0. , 0.01, 0.8 ],
[0. , 0. , 0. , 0.01, 0.8 ],
[0. , 0. , 0. , 0.01, 0.8 ]])
CodePudding user response:
In the first display
corpus tfidf labels
0 dfnkdfnkf asdfhedfh ajdladja [0.0, 0.0, 0.0, 0.01, 0.8] 60
...
values in the tfidf
column are Python lists (through strings and numpy arrays display the same).
df[col].values
will produce a 1d object dtype array containing these lists.
x = np.stack(df[col].values)
has a chance of turning that into a 2d float dtype array.
The second "unlisted" display is not valid - unless you strip the []
off string elements.
corpus tfidf labels
0 dfnkdfnkf asdfhedfh ajdladja 0.0, 0.0, 0.0, 0.01, 0.8 60
Dataframes with list or array elements are something of an anomoly, and many beginner user aren't prepared to deal with them. Frames are easiest to work with when the cell values are strings or numbers. But even strings are stored as Python objects.