I want to label encode subgroups in a pandas dataframe. Something like this:
| Category | | Name |
| ---------- | | --------- |
| FRUITS | | Apple |
| FRUITS | | Orange |
| FRUITS | | Apple |
| Vegetables | | Onion |
| Vegetables | | Garlic |
| Vegetables | | Garlic |
to
| Category | | Name | | Label |
| ---------- | | ------- | | ----- |
| FRUITS | | Apple | | 1 |
| FRUITS | | Orange | | 2 |
| FRUITS | | Apple | | 1 |
| Vegetables | | Onion | | 1 |
| Vegetables | | Garlic | | 2 |
| Vegetables | | Garlic | | 2 |
CodePudding user response:
Try to group-by "Category" and then group-by "Name" and use .ngroup()
:
df["Label"] = (
df.groupby("Category")
.apply(lambda x: x.groupby("Name", sort=False).ngroup() 1)
.values
)
print(df)
Prints:
Category Name Label
0 FRUITS Apple 1
1 FRUITS Orange 2
2 FRUITS Apple 1
3 Vegetables Onion 1
4 Vegetables Garlic 2
5 Vegetables Garlic 2
CodePudding user response:
You can use factorize
per group:
df['Label'] = (df.groupby('Category')['Name']
.transform(lambda x: pd.factorize(x)[0])
.add(1)
)
Output:
Category Name Label
0 FRUITS Apple 1
1 FRUITS Orange 2
2 FRUITS Apple 1
3 Vegetables Onion 1
4 Vegetables Garlic 2
5 Vegetables Garlic 2