I want to select the rows from dataframe that have "apple" in their column, but the "PHRASE" column should not be considered as string but as a list with "|" separator. Phrase column can have None/NaN values.
import pandas as pd
df = pd.DataFrame({"PHRASE": ["apple and bannaa|bannana", None, "apple", "orange|bannana", "orange|apple|bannana"]})
print(df)
0 apple and bannaa|bannana
1 apple
2 orange|bannana
3 orange|apple|bannana
I want to select rows that have apple in them, but the row should be considered as list with "|" as separator.
Desired output:
1 apple
3 orange|apple|bannana
I tried to do this
result = df[df.PHRASE.str.split("|").str.contains("apple")]
but it does not work.
CodePudding user response:
You can still use str.contains
but with a regex pattern
df[df['PHRASE'].str.contains(r'\bapple(?:\||$)', na=False)]
PHRASE
1 apple
3 orange|apple|bannana
CodePudding user response:
df.loc[[isinstance(splitted, list) and "apple" in splitted
for splitted in df.PHRASE.str.split("|")]]
(isinstance(splitted, list)
is added to prevent NaN et al.)
After your splitting, let's run a list comprehension to produce True/False's depending on the list in a given row has "apple" in it. Rest is boolean indexing:
In [129]: df
Out[129]:
PHRASE
0 apple and bannaa|bannana
1 apple
2 orange|bannana
3 orange|apple|bannana
4 NaN
In [130]: df.PHRASE.str.split("|")
Out[130]:
0 [apple and bannaa, bannana]
1 [apple]
2 [orange, bannana]
3 [orange, apple, bannana]
4 NaN
Name: PHRASE, dtype: object
In [131]: [isinstance(splitted, list) and "apple" in splitted
...: for splitted in df.PHRASE.str.split("|")]
Out[131]: [False, True, False, True, False]
In [132]: df.loc[_]
Out[132]:
PHRASE
1 apple
3 orange|apple|bannana
CodePudding user response:
You can try str.split
then explode
the list to rows
mask = (df['PHRASE'].str.split('|')
.explode().eq('apple')
.groupby(level=0).any())
print(mask)
0 False
1 False
2 True
3 False
4 True
Name: PHRASE, dtype: bool
print(df[mask])
PHRASE
2 apple
4 orange|apple|bannana