So, I have a Pandas dataframe with food and cuisine some people like. I have to break those into columns, so each food or cuisine should become a column. Each food/cuisine comes afer a comma, but if i break my string only by commas, I'll lost the content inside the parenthesis, which should be there, close to the dish. I think I should use '),' as a separator, right? But I don't know how to do that. This is my DF:
>>> PD_FOODS
USER_ID | FOODS_I_LIKE |
_______________________________________________________________________________
0 100 | Pizza(without garlic, tomatos and onion),pasta |
1 101 | Seafood,veggies |
2 102 | Indian food (no pepper, no curry),mexican food(no pepper) |
3 103 | Texmex, african food, japanese food,italian food |
4 104 | Seafood(no shrimps, no lobster),italian food(no gluten, no milk)|
Is it possible to get a result like this bellow?
>>> PD_FOODS
USER_ID | FOODS_I_LIKE_1 | FOODS_I_LIKE_2 |
_______________________________________________________________________________
0 100 | Pizza(without garlic, tomatos and onion)| pasta |
Thank you!
CodePudding user response:
You could use a regex with a negative lookahead:
(df['FOODS_I_LIKE'].str.split(',\s*(?![^()]*\))', expand=True)
.rename(columns=lambda x: int(x) 1)
.add_prefix('FOODS_I_LIKE_')
)
output:
FOODS_I_LIKE_1 FOODS_I_LIKE_2 FOODS_I_LIKE_3 FOODS_I_LIKE_4
0 Pizza(without garlic, tomatos and onion) pasta None None
1 Seafood veggies None None
2 Indian food (no pepper, no curry) mexican food(no pepper) None None
3 Texmex african food japanese food italian food
4 Seafood(no shrimps, no lobster) italian food(no gluten, no milk) None None
You can test the regex here
NB. this won't work on nested parenthesis, you would need to use a parser
CodePudding user response:
Try this:
df=pd.DataFrame({"User_ID":[1000,1001,1002,1003,1004],
"FOODS_I_LIKE":['Pizza(without garlic, tomatos and onion),pasta',
'Seafood,veggies',
'Indian food (no pepper, no curry),mexican food(no pepper)',
'Texmex, african food, japanese food,italian food',
'Seafood(no shrimps, no lobster),italian food(no gluten, no milk)']})
def my_func(my_string, item_num):
try:
if ')' in my_string:
if item_num == 0:
return my_string.split('),')[item_num] ')'
else:
return my_string.split('),')[item_num]
else:
return my_string.split(',')[item_num]
except IndexError:
return np.nan
for k in range(0,4):
K=str(k 1)
df[f'FOODS_I_LIKE_{K}']=df.FOODS_I_LIKE.apply(lambda x: my_func(x, k))
df.drop(columns='FOODS_I_LIKE')
Output:
User_ID | FOODS_I_LIKE_1 | FOODS_I_LIKE_2 | FOODS_I_LIKE_3 | FOODS_I_LIKE_4 |
---|---|---|---|---|
1000 | Pizza(without garlic, tomatos and onion) | pasta | NaN | NaN |
1001 | Seafood | veggies | NaN | NaN |
1002 | Indian food (no pepper, no curry) | mexican food(no pepper) | NaN | NaN |
1003 | Texmex | african food | japanese food | italian food |
1004 | Seafood(no shrimps, no lobster) | italian food(no gluten, no milk) | NaN | NaN |