Python TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]-CodePudding

I want to one-hot encode the variables of my dataset. My code is raising TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0].

Dataframe

print(df.head())
   country  year     sex          age  suicides_no  population  \
0  Albania  1987    male  15-24 years           21      312900   
1  Albania  1987    male  35-54 years           16      308000   
2  Albania  1987  female  15-24 years           14      289700   
3  Albania  1987    male    75  years            1       21800   
4  Albania  1987    male  25-34 years            9      274300   

   suicides/100k pop country-year  HDI for year   gdp_for_year ($)   \
0               6.71  Albania1987           NaN        2.156625e 09   
1               5.19  Albania1987           NaN        2.156625e 09   
2               4.83  Albania1987           NaN        2.156625e 09   
3               4.59  Albania1987           NaN        2.156625e 09   
4               3.28  Albania1987           NaN        2.156625e 09   

   gdp_per_capita ($)       generation  
0                 796     Generation X  
1                 796           Silent  
2                 796     Generation X  
3                 796  G.I. Generation  
4                 796          Boomers

Code

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
df['year_label'] = ohe.fit_transform(df['year'].to_numpy().reshape(-1, 1))
df['year_label'].unique()

Traceback

> --------------------------------------------------------------------------- TypeError                                 Traceback (most recent call
> last) /tmp/ipykernel_6768/3587352959.py in <module>
>       1 # One-hot encoding
>       2 ohe = OneHotEncoder()
> ----> 3 df['year_label'] = ohe.fit_transform(df['year'].to_numpy().reshape(-1, 1))
>       4 df['year_label'].unique()
>       5 df['sex_label'] = ohe.fit_transform(df['sex'])
> 
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py
> in __setitem__(self, key, value)    3610         else:    3611        
> # set column
> -> 3612             self._set_item(key, value)    3613     3614     def _setitem_slice(self, key: slice, value):
> 
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py
> in _set_item(self, key, value)    3782         ensure homogeneity.   
> 3783         """
> -> 3784         value = self._sanitize_column(value)    3785     3786         if (
> 
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py
> in _sanitize_column(self, value)    4507     4508         if
> is_list_like(value):
> -> 4509             com.require_length_match(value, self.index)    4510         return sanitize_array(value, self.index, copy=True,
> allow_2d=True)    4511 
> 
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/pandas/core/common.py
> in require_length_match(data, index)
>     528     Check the length of data matches the length of the index.
>     529     """
> --> 530     if len(data) != len(index):
>     531         raise ValueError(
>     532             "Length of values "
> 
> ~/anaconda3/envs/tf/lib/python3.9/site-packages/scipy/sparse/base.py
> in __len__(self)
>     289     # non-zeros is more important.  For now, raise an exception!
>     290     def __len__(self):
> --> 291         raise TypeError("sparse matrix length is ambiguous; use getnnz()"
>     292                         " or shape[0]")
>     293 
> 
> TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

CodePudding user response：

There is a simple way to one-hot encode variables in pandas using pandas.get_dummies.

As follows:

import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],

                   'C': [1, 2, 3]})
pd.get_dummies(df, prefix=['col1', 'col2'])

Ouptut:

   C  col1_a  col1_b  col2_a  col2_b  col2_c
0  1       1       0       0       1       0
1  2       0       1       1       0       0
2  3       1       0       0       0       1

Then you can simply merge the result with your DataFrame.

CodePudding user response：

Make a simple dataframe:

In [20]: x = np.array([1987,1987, 1986, 1985])
In [21]: df = pd.DataFrame(x[:,None], columns=['x'])
In [22]: df
Out[22]: 
      x
0  1987
1  1987
2  1986
3  1985
In [23]: one=OneHotEncoder()
In [24]: one.fit_transform(df['x'].to_numpy().reshape(-1,1))
Out[24]: 
<4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>

one has returned a scipy.sparse matrix, as documented.

Trying to assign that result to a dataframe column produces your error:

In [25]: df['new'] = one.fit_transform(df['x'].to_numpy().reshape(-1,1))
Traceback (most recent call last):
  File "<ipython-input-25-b30a637ba61b>", line 1, in <module>
    df['new'] = one.fit_transform(df['x'].to_numpy().reshape(-1,1))
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py", line 3612, in __setitem__
    self._set_item(key, value)
...

Set the line with pandas _set_item. That's the assignment operation.

We can tell OneHotEncode to return a dense, numpy, array:

In [27]: one=OneHotEncoder(sparse=False)
In [28]: one.fit_transform(df['x'].to_numpy().reshape(-1,1))
Out[28]: 
array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

However, trying to assign that to one column of the dataframe still produces an error. The array has 3 columns, one for each unique value.

In [29]: df['new'] = one.fit_transform(df['x'].to_numpy().reshape(-1,1))
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'new'

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py", line 3751, in _set_item_mgr
    loc = self._info_axis.get_loc(key)
  ...
ValueError: Wrong number of items passed 3, placement implies 1

But it does work if I convert the array to a list of lists. It now puts one list in each cell of the new column:

In [41]: df['new'] = one.fit_transform(df['x'].to_numpy().reshape(-1,1)).tolist()
In [42]: df
Out[42]: 
      x              new
0  1987  [0.0, 0.0, 1.0]
1  1987  [0.0, 0.0, 1.0]
2  1986  [0.0, 1.0, 0.0]
3  1985  [1.0, 0.0, 0.0]

There's probably a pandas method for splitting those lists into separate columns, but I'm not enough of a pandas expert.