Home > Enterprise >  How do I pass in a 1d array to sklearn's LabelEncoder?
How do I pass in a 1d array to sklearn's LabelEncoder?

Time:12-04

I'm following along an Uber-Lyft price prediction notebook on Kaggle, but I'm trying to use the Polars module.

In cell 43 where they use sklearn's LabelEncoder, they have the following loop that appears to loop through each feature, except for price, and encodes it:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

df_cat_encode= df_cat.copy()
for col in df_cat_encode.select_dtypes(include='O').columns:
    df_cat_encode[col]=le.fit_transform(df_cat_encode[col])

The data being passed through looks like this:

source destination cab_type name short_summary icon price
Haymarket Square North Station Lyft Shared Mostly Cloudy partly-cloudy-night 5.0
Haymarket Square North Station Lyft Lux Rain rain 11.0
Haymarket Square North Station Lyft Lyft Clear clear-night 7.0
Haymarket Square North Station Lyft Lux Black XL Clear clear-night 26.0

and the label encoded result looks like this:

637975 rows x 7 columns

source destination cab_type name short_summary icon price
5 7 0 7 4 5 5.0
5 7 0 2 8 6 11.0
5 7 0 5 0 1 7.0
5 7 0 4 6 1 26.0
... ... ... ... ... ... ...

The problem I'm having is when I try to build the same loop with Polars syntax like

for col in df_cat_encode.select(["source","destination","cab_type","name","short_summary","icon"]).columns:
    df_cat_encode.with_column(le.fit_transform(col))

I get the following error

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/preprocessing/_label.py", line 115, in fit_transform
    y = column_or_1d(y, warn=True)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1038, in column_or_1d
    raise ValueError(
ValueError: y should be a 1d array, got an array of shape () instead.

What am I doing wrong, and how can I fix this?

CodePudding user response:

It looks like this encoding is the equivalent of a "dense" ranking.

>>> df_cat_encode
   source  destination  cab_type  name  short_summary  icon  price
0       0            0         0     3              1     1      5.0
1       0            0         0     0              2     2      11.0
2       0            0         0     2              0     0      7.0
3       0            0         0     1              0     0      26.0

Which you can do in polars using .rank():

>>> df.with_columns(pl.all().exclude("price").rank(method="dense") - 1)
shape: (4, 7)
┌────────┬─────────────┬──────────┬──────┬───────────────┬──────┬───────┐
│ source | destination | cab_type | name | short_summary | icon | price │
│ ---    | ---         | ---      | ---  | ---           | ---  | ---   │
│ u32    | u32         | u32      | u32  | u32           | u32  | f64   │
╞════════╪═════════════╪══════════╪══════╪═══════════════╪══════╪═══════╡
│ 0      | 0           | 0        | 3    | 1             | 1    | 5.0   │
├────────┼─────────────┼──────────┼──────┼───────────────┼──────┼───────┤
│ 0      | 0           | 0        | 0    | 2             | 2    | 11.0  │
├────────┼─────────────┼──────────┼──────┼───────────────┼──────┼───────┤
│ 0      | 0           | 0        | 2    | 0             | 0    | 7.0   │
├────────┼─────────────┼──────────┼──────┼───────────────┼──────┼───────┤
│ 0      | 0           | 0        | 1    | 0             | 0    | 26.0  │
└─//─────┴─//──────────┴─//───────┴─//───┴─//────────────┴─//───┴─//────┘
  • Related