I know there have been a lot of questions around this topic but I didn't find any that described my problem. I have a df
, with a specific column that looks like this:
colA
['drinks/coke/diet', 'food/spaghetti']
['drinks/water', 'drinks/tea', 'drinks/coke', 'food/pizza']
['drinks/coke/diet', 'drinks/coke']
...
The values of colA
are a string NOT a list. What I want to achieve is a new column, where I only keep part of the values that contain 'coke'. Coke can be repeated any number of times in the string, and be in any place. The values between ''
don't always contain en equal number of values seperated by /
.
So the result should look like this:
colA colB
['drinks/coke/diet', 'food/spaghetti'] 'drinks/coke/diet'
['drinks/water', 'drinks/tea', 'drinks/coke', 'food/pizza'] 'drinks/coke'
['drinks/coke/diet', 'drinks/coke'] 'drinks/coke/diet', 'drinks/coke'
...
I've tried calling a function:
import json
df['coke'] = df['colA'].apply(lambda secties: [s for s in json.loads(colA) if 'coke' in s], meta=str)
But this one keeps throwing errors that I don't know how to solve.
CodePudding user response:
You could split on comma and explode
to create a Series. Then use str.contains
to create a boolean mask that you could use to filter the items that contain the word "coke". Finally join
the strings back across indices:
s = df['colA'].str.split(',').explode()
df['colB'] = s[s.str.contains('coke')].groupby(level=0).apply(','.join).str.strip('[]')
Output:
colA colB
0 ['drinks/coke/diet', 'food/spaghetti'] 'drinks/coke/diet'
1 ['drinks/water', 'drinks/tea', 'drinks/coke', ... 'drinks/coke'
2 ['drinks/coke/diet', 'drinks/coke'] 'drinks/coke/diet', 'drinks/coke'
CodePudding user response:
Try splitting the string into a list and then making the check for coke in the list, something like this:
import json
df['coke'] = df['colA'].apply(lambda secties: [s for s in json.loads(colA.split("/")) if 'coke' in s], meta=str)