I have a Dataframe like this:
text
Is it possible to apply [NUM] times
Is it possible to apply [NUM] time
Called [NUM] hour ago
waited [NUM] hours
waiting [NUM] minute
waiting [NUM] minutes???
Are you kidding me !
Waiting?
I want to be able to detect pattern that have "[NUM] time" or "[NUM] times" or "[NUM] minute" or "[NUM] minutes" or "[NUM] hour" or "[NUM] hours"
. Also, if it has "!" (or more than one !)
or "??" (at least two ?)
.
So the result would look like this:
text. available
Is it possible to apply [NUM] times. True
Is it possible to apply [NUM] time. True
Called [NUM] hour ago True
waited [NUM] hours True
waiting [NUM] minute True
waiting [NUM] minutes??? True
Are you kidding me ! True
Waiting? False
I didn't like it False
So I want something like this but don't know how to put all these condition together:
df["available"] = df['text'].apply(lambda x: re.match(r'[\!* | \? | [NUM] time | [NUM] hour | [NUM] minute]')
CodePudding user response:
You can use Series.str.contains
with a regex:
import pandas as pd
df = pd.DataFrame({'text':["Is it possible to apply [NUM] times","Is it possible to apply [NUM] time","Called [NUM] hour ago","waited [NUM] hours","waiting [NUM] minute","waiting [NUM] minutes???","Are you kidding me !","Waiting?", "I didn't like it"]})
df['available'] = df['text'].str.contains(r'\[NUM]\s*(?:hour|minute|time)s?\b|!|\?{2}', regex=True)
## => df['available']
# 0 True
# 1 True
# 2 True
# 3 True
# 4 True
# 5 True
# 6 True
# 7 False
# 8 False
See the regex demo. Details:
\[NUM]
-[NUM]
string\s*
- zero or more whitespaces(?:hour|minute|time)
- a non-capturing group matchinghour
,minute
ortime
s?
- an optionals
\b
- a word boundary|
- or!
- a!
char|
- or\?{2}
- two question marks.