Select rows from pandas that contain given string in list-CodePudding

I want to select the rows from dataframe that have "apple" in their column, but the "PHRASE" column should not be considered as string but as a list with "|" separator. Phrase column can have None/NaN values.

import pandas as pd

df = pd.DataFrame({"PHRASE": ["apple and bannaa|bannana", None, "apple", "orange|bannana", "orange|apple|bannana"]})

print(df)

0  apple and bannaa|bannana
1                     apple
2            orange|bannana
3      orange|apple|bannana

I want to select rows that have apple in them, but the row should be considered as list with "|" as separator.

Desired output:

1                     apple
3      orange|apple|bannana

I tried to do this

result = df[df.PHRASE.str.split("|").str.contains("apple")]

but it does not work.

CodePudding user response：

You can still use str.contains but with a regex pattern

df[df['PHRASE'].str.contains(r'\bapple(?:\||$)', na=False)]

                 PHRASE
1                 apple
3  orange|apple|bannana

CodePudding user response：

df.loc[[isinstance(splitted, list) and "apple" in splitted 
        for splitted in df.PHRASE.str.split("|")]]

(isinstance(splitted, list) is added to prevent NaN et al.)

After your splitting, let's run a list comprehension to produce True/False's depending on the list in a given row has "apple" in it. Rest is boolean indexing:

In [129]: df
Out[129]:
                     PHRASE
0  apple and bannaa|bannana
1                     apple
2            orange|bannana
3      orange|apple|bannana
4                       NaN

In [130]: df.PHRASE.str.split("|")
Out[130]:
0    [apple and bannaa, bannana]
1                        [apple]
2              [orange, bannana]
3       [orange, apple, bannana]
4                            NaN
Name: PHRASE, dtype: object

In [131]: [isinstance(splitted, list) and "apple" in splitted
     ...:  for splitted in df.PHRASE.str.split("|")]
Out[131]: [False, True, False, True, False]

In [132]: df.loc[_]
Out[132]:
                 PHRASE
1                 apple
3  orange|apple|bannana

CodePudding user response：

You can try str.split then explode the list to rows

mask = (df['PHRASE'].str.split('|')
        .explode().eq('apple')
        .groupby(level=0).any())

print(mask)

0    False
1    False
2     True
3    False
4     True
Name: PHRASE, dtype: bool

print(df[mask])

                 PHRASE
2                 apple
4  orange|apple|bannana