Keep part of string based on certain characters in a DataFrame column-CodePudding

I know there have been a lot of questions around this topic but I didn't find any that described my problem. I have a df, with a specific column that looks like this:

colA   
['drinks/coke/diet', 'food/spaghetti']
['drinks/water', 'drinks/tea', 'drinks/coke', 'food/pizza']
['drinks/coke/diet', 'drinks/coke']
...

The values of colA are a string NOT a list. What I want to achieve is a new column, where I only keep part of the values that contain 'coke'. Coke can be repeated any number of times in the string, and be in any place. The values between '' don't always contain en equal number of values seperated by /.

So the result should look like this:

colA                                                               colB
['drinks/coke/diet', 'food/spaghetti']                           'drinks/coke/diet'
['drinks/water', 'drinks/tea', 'drinks/coke', 'food/pizza']      'drinks/coke'
['drinks/coke/diet', 'drinks/coke']                              'drinks/coke/diet', 'drinks/coke'
...

I've tried calling a function:

import json
df['coke'] = df['colA'].apply(lambda secties: [s for s in json.loads(colA) if 'coke' in s], meta=str)

But this one keeps throwing errors that I don't know how to solve.

CodePudding user response：

You could split on comma and explode to create a Series. Then use str.contains to create a boolean mask that you could use to filter the items that contain the word "coke". Finally join the strings back across indices:

s = df['colA'].str.split(',').explode()
df['colB'] = s[s.str.contains('coke')].groupby(level=0).apply(','.join).str.strip('[]')

Output:

                                                colA                                  colB  
0             ['drinks/coke/diet', 'food/spaghetti']                    'drinks/coke/diet'  
1  ['drinks/water', 'drinks/tea', 'drinks/coke', ...                         'drinks/coke'  
2                ['drinks/coke/diet', 'drinks/coke']     'drinks/coke/diet', 'drinks/coke'

CodePudding user response：

Try splitting the string into a list and then making the check for coke in the list, something like this:

import json
df['coke'] = df['colA'].apply(lambda secties: [s for s in json.loads(colA.split("/")) if 'coke' in s], meta=str)