Let's take an example.
I have a list of categories that are identified :
L_known_categories = ["Orange","Green","Red","Black & White"]
The strings in that list can't be a substring of another string in that list.
And a dataframe :
df = pd.DataFrame({"Items":["green apple","blue bottle","RED APPLE","Green paper","Black & White glasses",
"An orange fruit"]})
Items
0 green apple
1 blue bottle
2 RED APPLE
3 Green paper
4 Black & White glasses
5 An orange fruit
I would like to add a columns Category
to this dataframe. If the string in the column Items
starts as a string in L_known_categories
, no matter the case of the characters, the category is that string. If no string founded, the category is the string in columns Items
.
I could use a for loop but it is not efficient with my real big dataframe. How please could I do ?
Expected output :
Items Category
0 green apple Green
1 blue bottle blue bottle
2 RED APPLE Red
3 Green paper Green
4 Black & White glasses Black & White
5 An orange fruit An orange fruit
CodePudding user response:
You can use regex
in pandas.Series.str.extract
:
>>> df['Category'] = df['Items'].str.title().str.extract(
'(^'
'|'.join(L_known_categories)
')'
)[0].fillna(df['Items'])
>>> df
Items Category
0 green apple Green
1 blue bottle blue bottle
2 RED APPLE Red
3 Green paper Green
4 Black & White glasses Black & White
5 An orange fruit An orange fruit