Transform String data into numbers (where string should be always the same number), with the ability-CodePudding

I would like to build a Machine Learning solution, predicting upcoming sales per product.

The dataset is containing thousand products (which are represented as a string. E.g., ‘Product_1_12345’).

Since the product information is essential for the modelling (would like to forecast, on product level), I tried different approaches (among others creating dummies).

However, since this was causing too many columns, I am exploring an alternative. What I would like to have:

Original_Product_ID      New_Product_ID
Product1_ABC                1
Product4_ABC                2
Product1_ABC                1
Another_Product             3
Product4_ABC                2

The goal is to assign each unique string, to a number. But if we have that product later again, I would like to have the same number.

Later on, I would like to convert the numbers back to Original Product ID.

Does anyone know how to do this? A dictionary doesn’t look like a solution, since I need to fill it in automatically (and I have thousands of products).

CodePudding user response：

To convert you can use pandas.factorize:

This function output both the factors as numpy array and the unique IDs in order of the factors.

You can save both and use this unique ID list to map back the original IDs later on:

factors, ids = pd.factorize(df['Original_Product_ID'])

df['New_Product_ID'] = factors

# map the original IDs
df['Original_ID_from_factor'] = df['New_Product_ID'].map(dict(enumerate(ids)))

output:

  Original_Product_ID  New_Product_ID Original_ID_from_factor
0        Product1_ABC               0            Product1_ABC
1        Product4_ABC               1            Product4_ABC
2        Product1_ABC               0            Product1_ABC
3     Another_Product               2         Another_Product
4        Product4_ABC               1            Product4_ABC