Home > Mobile >  How can I split a text between commas and make each new part a new column?
How can I split a text between commas and make each new part a new column?

Time:10-05

So, I have a Pandas dataframe with food and cuisine some people like. I have to break those into columns, so each food or cuisine should become a column. Each food/cuisine comes afer a comma, but if i break my string only by commas, I'll lost the content inside the parenthesis, which should be there, close to the dish. I think I should use '),' as a separator, right? But I don't know how to do that. This is my DF:

>>> PD_FOODS
  USER_ID  | FOODS_I_LIKE                                                      |
_______________________________________________________________________________
0   100    |   Pizza(without garlic, tomatos and onion),pasta                  |
1   101    |   Seafood,veggies                                                 |
2   102    |   Indian food (no pepper, no curry),mexican food(no pepper)       |
3   103    |   Texmex, african food, japanese food,italian food                |
4   104    |   Seafood(no shrimps, no lobster),italian food(no gluten, no milk)|

Is it possible to get a result like this bellow?

>>> PD_FOODS
  USER_ID  |   FOODS_I_LIKE_1                          |  FOODS_I_LIKE_2       |
_______________________________________________________________________________
0   100    |   Pizza(without garlic, tomatos and onion)|  pasta                |

Thank you!

CodePudding user response:

You could use a regex with a negative lookahead:

(df['FOODS_I_LIKE'].str.split(',\s*(?![^()]*\))', expand=True)
   .rename(columns=lambda x: int(x) 1)
   .add_prefix('FOODS_I_LIKE_')
)

output:

                             FOODS_I_LIKE_1                    FOODS_I_LIKE_2 FOODS_I_LIKE_3 FOODS_I_LIKE_4
0  Pizza(without garlic, tomatos and onion)                             pasta           None           None
1                                   Seafood                           veggies           None           None
2         Indian food (no pepper, no curry)           mexican food(no pepper)           None           None
3                                    Texmex                      african food  japanese food   italian food
4           Seafood(no shrimps, no lobster)  italian food(no gluten, no milk)           None           None

You can test the regex here

NB. this won't work on nested parenthesis, you would need to use a parser

CodePudding user response:

Try this:

df=pd.DataFrame({"User_ID":[1000,1001,1002,1003,1004],
               "FOODS_I_LIKE":['Pizza(without garlic, tomatos and onion),pasta',
                              'Seafood,veggies',
                              'Indian food (no pepper, no curry),mexican food(no pepper)',
                              'Texmex, african food, japanese food,italian food',
                              'Seafood(no shrimps, no lobster),italian food(no gluten, no milk)']})

def my_func(my_string, item_num):
    try:
        if ')' in my_string:
            if item_num == 0:
                return my_string.split('),')[item_num] ')'
            else:
                return my_string.split('),')[item_num]
        else:
            return my_string.split(',')[item_num]
    except IndexError:
        return np.nan
    
for k in range(0,4):
    K=str(k 1)
    df[f'FOODS_I_LIKE_{K}']=df.FOODS_I_LIKE.apply(lambda x: my_func(x, k))

df.drop(columns='FOODS_I_LIKE')

Output:

User_ID FOODS_I_LIKE_1 FOODS_I_LIKE_2 FOODS_I_LIKE_3 FOODS_I_LIKE_4
1000 Pizza(without garlic, tomatos and onion) pasta NaN NaN
1001 Seafood veggies NaN NaN
1002 Indian food (no pepper, no curry) mexican food(no pepper) NaN NaN
1003 Texmex african food japanese food italian food
1004 Seafood(no shrimps, no lobster) italian food(no gluten, no milk) NaN NaN
  • Related