Home > Software engineering >  How can I put multiple conditions for detecting a pattern in pandas using regex
How can I put multiple conditions for detecting a pattern in pandas using regex

Time:09-28

I have a Dataframe like this:

text

Is it possible to apply [NUM] times
Is it possible to apply [NUM] time
Called [NUM] hour ago
waited [NUM] hours
waiting [NUM] minute
waiting [NUM] minutes???
Are you kidding me !
Waiting?

I want to be able to detect pattern that have "[NUM] time" or "[NUM] times" or "[NUM] minute" or "[NUM] minutes" or "[NUM] hour" or "[NUM] hours". Also, if it has "!" (or more than one !) or "??" (at least two ?).

So the result would look like this:

text.                                  available

Is it possible to apply [NUM] times.   True
Is it possible to apply [NUM] time.    True
Called [NUM] hour ago                  True
waited [NUM] hours                     True
waiting [NUM] minute                   True
waiting [NUM] minutes???               True
Are you kidding me !                   True
Waiting?                               False
I didn't like it                       False

So I want something like this but don't know how to put all these condition together:

df["available"] = df['text'].apply(lambda x: re.match(r'[\!* | \?  | [NUM] time | [NUM] hour | [NUM] minute]')

CodePudding user response:

You can use Series.str.contains with a regex:

import pandas as pd
df = pd.DataFrame({'text':["Is it possible to apply [NUM] times","Is it possible to apply [NUM] time","Called [NUM] hour ago","waited [NUM] hours","waiting [NUM] minute","waiting [NUM] minutes???","Are you kidding me !","Waiting?", "I didn't like it"]})
df['available'] = df['text'].str.contains(r'\[NUM]\s*(?:hour|minute|time)s?\b|!|\?{2}', regex=True)
## => df['available']
#     0     True
#     1     True
#     2     True
#     3     True
#     4     True
#     5     True
#     6     True
#     7    False
#     8    False

See the regex demo. Details:

  • \[NUM] - [NUM] string
  • \s* - zero or more whitespaces
  • (?:hour|minute|time) - a non-capturing group matching hour, minute or time
  • s? - an optional s
  • \b - a word boundary
  • | - or
  • ! - a ! char
  • | - or
  • \?{2} - two question marks.
  • Related