Currently I am trying to pass a 2D matrix into the sklearn OneHotEncoder. Whenever I try to pass the matrix I get this error:
Encoders require their input to be uniformly strings or numbers. Got ['list']
After a bit of investigation, I see the matrix being returned is showing:
[list(['e2', 'e4', 'e5']) list(['e1', 'e2', 'e3', 'e4'])
list(['e1', 'e2']) list(['e1', 'e2', 'e3', 'e4', 'e5'])
list(['e1', 'e2', 'e3', 'e4', 'e5'])
list(['e1', 'e2', 'e3', 'e4', 'e5', 'e6'])]
As you can see instead of just being a 2D matrix, I see the outer array is correct but the inner array encapsulates the arrays with list(). I was wondering how to fix this.
Below is the code I am trying to get the list of IDS column from the pandas dataframe
arr = np.asarray(result['IDS'], dtype=object)
CodePudding user response:
Using a copy-n-paste from your question:
In [239]: [list(['e2', 'e4', 'e5']), list(['e1', 'e2', 'e3', 'e4']),
...: list(['e1', 'e2']), list(['e1', 'e2', 'e3', 'e4', 'e5']),
...: list(['e1', 'e2', 'e3', 'e4', 'e5']),
...: list(['e1', 'e2', 'e3', 'e4', 'e5', 'e6'])]
Out[239]:
[['e2', 'e4', 'e5'],
['e1', 'e2', 'e3', 'e4'],
['e1', 'e2'],
['e1', 'e2', 'e3', 'e4', 'e5'],
['e1', 'e2', 'e3', 'e4', 'e5'],
['e1', 'e2', 'e3', 'e4', 'e5', 'e6']]
In [240]: np.array(_)
<ipython-input-240-7a2cd91c32ca>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
np.array(_)
Out[240]:
array([list(['e2', 'e4', 'e5']), list(['e1', 'e2', 'e3', 'e4']),
list(['e1', 'e2']), list(['e1', 'e2', 'e3', 'e4', 'e5']),
list(['e1', 'e2', 'e3', 'e4', 'e5']),
list(['e1', 'e2', 'e3', 'e4', 'e5', 'e6'])], dtype=object)
I assume you used the object
dtype because you got this 'ragged' warning:
np.asarray(result['IDS'], dtype=object)
And I assume result['IDS']
looks a lot like Out[239]
, a list of lists that vary in length. Or rather result
is a dataframe, and this is a Series, a column of the dataframe.
You might want to show result
or result['IDS']
. I can guess what it looks like.
What kind of 2d array were you expecting? With component lists that vary from 2 to 6 elements, there's no way you can make a 2d array!
Making a Series:
In [243]: S = pd.Series(Out[239])
In [244]: S
Out[244]:
0 [e2, e4, e5]
1 [e1, e2, e3, e4]
2 [e1, e2]
3 [e1, e2, e3, e4, e5]
4 [e1, e2, e3, e4, e5]
5 [e1, e2, e3, e4, e5, e6]