I want to one-hot encode the variables of my dataset. My code is raising TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
.
Dataframe
print(df.head())
country year sex age suicides_no population \
0 Albania 1987 male 15-24 years 21 312900
1 Albania 1987 male 35-54 years 16 308000
2 Albania 1987 female 15-24 years 14 289700
3 Albania 1987 male 75 years 1 21800
4 Albania 1987 male 25-34 years 9 274300
suicides/100k pop country-year HDI for year gdp_for_year ($) \
0 6.71 Albania1987 NaN 2.156625e 09
1 5.19 Albania1987 NaN 2.156625e 09
2 4.83 Albania1987 NaN 2.156625e 09
3 4.59 Albania1987 NaN 2.156625e 09
4 3.28 Albania1987 NaN 2.156625e 09
gdp_per_capita ($) generation
0 796 Generation X
1 796 Silent
2 796 Generation X
3 796 G.I. Generation
4 796 Boomers
Code
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
df['year_label'] = ohe.fit_transform(df['year'].to_numpy().reshape(-1, 1))
df['year_label'].unique()
Traceback
> --------------------------------------------------------------------------- TypeError Traceback (most recent call
> last) /tmp/ipykernel_6768/3587352959.py in <module>
> 1 # One-hot encoding
> 2 ohe = OneHotEncoder()
> ----> 3 df['year_label'] = ohe.fit_transform(df['year'].to_numpy().reshape(-1, 1))
> 4 df['year_label'].unique()
> 5 df['sex_label'] = ohe.fit_transform(df['sex'])
>
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py
> in __setitem__(self, key, value) 3610 else: 3611
> # set column
> -> 3612 self._set_item(key, value) 3613 3614 def _setitem_slice(self, key: slice, value):
>
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py
> in _set_item(self, key, value) 3782 ensure homogeneity.
> 3783 """
> -> 3784 value = self._sanitize_column(value) 3785 3786 if (
>
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py
> in _sanitize_column(self, value) 4507 4508 if
> is_list_like(value):
> -> 4509 com.require_length_match(value, self.index) 4510 return sanitize_array(value, self.index, copy=True,
> allow_2d=True) 4511
>
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/pandas/core/common.py
> in require_length_match(data, index)
> 528 Check the length of data matches the length of the index.
> 529 """
> --> 530 if len(data) != len(index):
> 531 raise ValueError(
> 532 "Length of values "
>
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/scipy/sparse/base.py
> in __len__(self)
> 289 # non-zeros is more important. For now, raise an exception!
> 290 def __len__(self):
> --> 291 raise TypeError("sparse matrix length is ambiguous; use getnnz()"
> 292 " or shape[0]")
> 293
>
> TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
CodePudding user response:
There is a simple way to one-hot encode variables in pandas using pandas.get_dummies
.
As follows:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
'C': [1, 2, 3]})
pd.get_dummies(df, prefix=['col1', 'col2'])
Ouptut:
C col1_a col1_b col2_a col2_b col2_c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1
Then you can simply merge the result with your DataFrame.
CodePudding user response:
Make a simple dataframe:
In [20]: x = np.array([1987,1987, 1986, 1985])
In [21]: df = pd.DataFrame(x[:,None], columns=['x'])
In [22]: df
Out[22]:
x
0 1987
1 1987
2 1986
3 1985
In [23]: one=OneHotEncoder()
In [24]: one.fit_transform(df['x'].to_numpy().reshape(-1,1))
Out[24]:
<4x3 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
one
has returned a scipy.sparse
matrix, as documented.
Trying to assign that result to a dataframe column produces your error:
In [25]: df['new'] = one.fit_transform(df['x'].to_numpy().reshape(-1,1))
Traceback (most recent call last):
File "<ipython-input-25-b30a637ba61b>", line 1, in <module>
df['new'] = one.fit_transform(df['x'].to_numpy().reshape(-1,1))
File "/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py", line 3612, in __setitem__
self._set_item(key, value)
...
Set the line with pandas
_set_item
. That's the assignment operation.
We can tell OneHotEncode
to return a dense, numpy, array:
In [27]: one=OneHotEncoder(sparse=False)
In [28]: one.fit_transform(df['x'].to_numpy().reshape(-1,1))
Out[28]:
array([[0., 0., 1.],
[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.]])
However, trying to assign that to one column of the dataframe still produces an error. The array has 3 columns, one for each unique value.
In [29]: df['new'] = one.fit_transform(df['x'].to_numpy().reshape(-1,1))
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'new'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py", line 3751, in _set_item_mgr
loc = self._info_axis.get_loc(key)
...
ValueError: Wrong number of items passed 3, placement implies 1
But it does work if I convert the array to a list of lists. It now puts one list in each cell of the new
column:
In [41]: df['new'] = one.fit_transform(df['x'].to_numpy().reshape(-1,1)).tolist()
In [42]: df
Out[42]:
x new
0 1987 [0.0, 0.0, 1.0]
1 1987 [0.0, 0.0, 1.0]
2 1986 [0.0, 1.0, 0.0]
3 1985 [1.0, 0.0, 0.0]
There's probably a pandas
method for splitting those lists into separate columns, but I'm not enough of a pandas
expert.