I'm following along an Uber-Lyft price prediction notebook on Kaggle, but I'm trying to use the Polars module.
In cell 43 where they use sklearn's LabelEncoder, they have the following loop that appears to loop through each feature, except for price, and encodes it:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df_cat_encode= df_cat.copy()
for col in df_cat_encode.select_dtypes(include='O').columns:
df_cat_encode[col]=le.fit_transform(df_cat_encode[col])
The data being passed through looks like this:
source | destination | cab_type | name | short_summary | icon | price |
---|---|---|---|---|---|---|
Haymarket Square | North Station | Lyft | Shared | Mostly Cloudy | partly-cloudy-night | 5.0 |
Haymarket Square | North Station | Lyft | Lux | Rain | rain | 11.0 |
Haymarket Square | North Station | Lyft | Lyft | Clear | clear-night | 7.0 |
Haymarket Square | North Station | Lyft | Lux Black XL | Clear | clear-night | 26.0 |
and the label encoded result looks like this:
637975 rows x 7 columns
source | destination | cab_type | name | short_summary | icon | price |
---|---|---|---|---|---|---|
5 | 7 | 0 | 7 | 4 | 5 | 5.0 |
5 | 7 | 0 | 2 | 8 | 6 | 11.0 |
5 | 7 | 0 | 5 | 0 | 1 | 7.0 |
5 | 7 | 0 | 4 | 6 | 1 | 26.0 |
... | ... | ... | ... | ... | ... | ... |
The problem I'm having is when I try to build the same loop with Polars syntax like
for col in df_cat_encode.select(["source","destination","cab_type","name","short_summary","icon"]).columns:
df_cat_encode.with_column(le.fit_transform(col))
I get the following error
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/preprocessing/_label.py", line 115, in fit_transform
y = column_or_1d(y, warn=True)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1038, in column_or_1d
raise ValueError(
ValueError: y should be a 1d array, got an array of shape () instead.
What am I doing wrong, and how can I fix this?
CodePudding user response:
It looks like this encoding is the equivalent of a "dense" ranking.
>>> df_cat_encode
source destination cab_type name short_summary icon price
0 0 0 0 3 1 1 5.0
1 0 0 0 0 2 2 11.0
2 0 0 0 2 0 0 7.0
3 0 0 0 1 0 0 26.0
Which you can do in polars using .rank()
:
>>> df.with_columns(pl.all().exclude("price").rank(method="dense") - 1)
shape: (4, 7)
┌────────┬─────────────┬──────────┬──────┬───────────────┬──────┬───────┐
│ source | destination | cab_type | name | short_summary | icon | price │
│ --- | --- | --- | --- | --- | --- | --- │
│ u32 | u32 | u32 | u32 | u32 | u32 | f64 │
╞════════╪═════════════╪══════════╪══════╪═══════════════╪══════╪═══════╡
│ 0 | 0 | 0 | 3 | 1 | 1 | 5.0 │
├────────┼─────────────┼──────────┼──────┼───────────────┼──────┼───────┤
│ 0 | 0 | 0 | 0 | 2 | 2 | 11.0 │
├────────┼─────────────┼──────────┼──────┼───────────────┼──────┼───────┤
│ 0 | 0 | 0 | 2 | 0 | 0 | 7.0 │
├────────┼─────────────┼──────────┼──────┼───────────────┼──────┼───────┤
│ 0 | 0 | 0 | 1 | 0 | 0 | 26.0 │
└─//─────┴─//──────────┴─//───────┴─//───┴─//────────────┴─//───┴─//────┘