I am trying to identify substrings in a given string using str.contains while mixing the OR and AND
I know that OR can be represented by |
str.contains("error|break|insufficient")
and that AND can be represented by AND
str.contains("error|break|insufficient") & str.contains("status")
I would like to mix the OR and AND together. Example is to identify strings that have "error" OR "break OR ("insufficient" AND "status")
So for sentence like "error break insufficient" -> it will be able to identify. But now is not able to because there is no "status" in the sentence
CodePudding user response:
One approach:
import pandas as pd
# toy data
s = pd.Series(["hello", "world", "simple", "error", "break", "insufficient something status", "status"])
# create and mask
insufficient_and_status = s.str.contains("insufficient") & s.str.contains("status")
# create or mask
break_or_error = s.str.contains("error|break", regex=True)
# or the two mask
mask = break_or_error | insufficient_and_status
res = s[mask]
print(res)
Output
3 error
4 break
5 insufficient something status
dtype: object
Alternative, using a single regex:
mask = s.str.contains("error|break|(insufficient. status|status. insufficient)", regex=True)
res = s[mask]
print(res)
The alternative is based on the fact that if the string contains insufficient and status then at least one of the patterns insufficient. status
or status. insufficient
matches (i.e. or insufficient occurs first or status does)