Filter columns contains all substring-CodePudding

i am trying to select all crispy chicken sandwich in datasets, i have tried using this regex but it still got some grilled chicken sandwich. Here is the code

data_sandwich_crispy = data[data['Item'].str.contains(r'^(?=.*crispy)(?=.*sandwich)(?=.*chicken)', regex=True)]

and here is the look of datasets

any revision, or link to answer is really appreciated. i'm really sorry if there was a mistake, thanks you for all your help!

CodePudding user response：

This would be my solution. It looks for strings where the word Crispy is followed by the word Chicken that is followed by the word Sandwich. However, there can be an arbitrary number of spaces or any other characters in between.

# some data
l = ["Crispy Chicken Sandwich", 
     "Grilled Chicken Sandwich", 
     "crispy Chicken Sandwich"]
data = pd.DataFrame(l, columns=["A"])
data
#       A
# 0     Crispy Chicken Sandwich
# 1     Grilled Chicken Sandwich
# 2     crispy Chicken Sandwich


# consider `case`
data[data['A'].str.contains(r'Crispy. Chicken. Sandwich', regex=True, case=False)]
#       A
# 0     Crispy Chicken Sandwich
# 2     crispy Chicken Sandwich

CodePudding user response：

If you meant collecting all rows containing crispy chicken sandwhich only, then have a look at this alternative solution below. This will return rows only when all three words (crispy, chicken and classic) are present :

data_sandwich_crispy = df[df['item'].str.contains(r'^(?=.*?\bcrispy\b)(?=.*?\bchicken\b)(?=.*?\bclassic\b).*$',regex=True)]

I created a simple dataframe as shown below:

item    id
premium crispy chicken classic sandwhich    10
premium grilled chicken classic sandwhich   15
premium club chicken classic sandwhich      14

running the command given above gives the following output:

item    id
premium crispy chicken classic sandwhich    10