How do I pass in a 1d array to sklearn's LabelEncoder?-CodePudding

I'm following along an Uber-Lyft price prediction notebook on Kaggle, but I'm trying to use the Polars module.

In cell 43 where they use sklearn's LabelEncoder, they have the following loop that appears to loop through each feature, except for price, and encodes it:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

df_cat_encode= df_cat.copy()
for col in df_cat_encode.select_dtypes(include='O').columns:
    df_cat_encode[col]=le.fit_transform(df_cat_encode[col])

The data being passed through looks like this:

source	destination	cab_type	name	short_summary	icon	price
Haymarket Square	North Station	Lyft	Shared	Mostly Cloudy	partly-cloudy-night	5.0
Haymarket Square	North Station	Lyft	Lux	Rain	rain	11.0
Haymarket Square	North Station	Lyft	Lyft	Clear	clear-night	7.0
Haymarket Square	North Station	Lyft	Lux Black XL	Clear	clear-night	26.0

and the label encoded result looks like this:

637975 rows x 7 columns

source	destination	cab_type	name	short_summary	icon	price
5	7	0	7	4	5	5.0
5	7	0	2	8	6	11.0
5	7	0	5	0	1	7.0
5	7	0	4	6	1	26.0
...	...	...	...	...	...	...

The problem I'm having is when I try to build the same loop with Polars syntax like

for col in df_cat_encode.select(["source","destination","cab_type","name","short_summary","icon"]).columns:
    df_cat_encode.with_column(le.fit_transform(col))

I get the following error

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/preprocessing/_label.py", line 115, in fit_transform
    y = column_or_1d(y, warn=True)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1038, in column_or_1d
    raise ValueError(
ValueError: y should be a 1d array, got an array of shape () instead.

What am I doing wrong, and how can I fix this?

CodePudding user response：

It looks like this encoding is the equivalent of a "dense" ranking.

>>> df_cat_encode
   source  destination  cab_type  name  short_summary  icon  price
0       0            0         0     3              1     1      5.0
1       0            0         0     0              2     2      11.0
2       0            0         0     2              0     0      7.0
3       0            0         0     1              0     0      26.0

Which you can do in polars using .rank():

>>> df.with_columns(pl.all().exclude("price").rank(method="dense") - 1)
shape: (4, 7)
┌────────┬─────────────┬──────────┬──────┬───────────────┬──────┬───────┐
│ source | destination | cab_type | name | short_summary | icon | price │
│ ---    | ---         | ---      | ---  | ---           | ---  | ---   │
│ u32    | u32         | u32      | u32  | u32           | u32  | f64   │
╞════════╪═════════════╪══════════╪══════╪═══════════════╪══════╪═══════╡
│ 0      | 0           | 0        | 3    | 1             | 1    | 5.0   │
├────────┼─────────────┼──────────┼──────┼───────────────┼──────┼───────┤
│ 0      | 0           | 0        | 0    | 2             | 2    | 11.0  │
├────────┼─────────────┼──────────┼──────┼───────────────┼──────┼───────┤
│ 0      | 0           | 0        | 2    | 0             | 0    | 7.0   │
├────────┼─────────────┼──────────┼──────┼───────────────┼──────┼───────┤
│ 0      | 0           | 0        | 1    | 0             | 0    | 26.0  │
└─//─────┴─//──────────┴─//───────┴─//───┴─//────────────┴─//───┴─//────┘