I'm trying to scrape a portion of text out of a long text using regex.
Original text: If you have any questions or concerns, you may contact us at kaieldentsome [!at] gmail.com. You can also follow us on fb
Portion I'm interested in: kaieldentsome [!at] gmail.com.
It's not necessary that contact us at
will always be present there.
I've tried with:
import re
item_str = 'If you have any questions or concerns, you may contact us at kaieldentsome [!at] gmail.com. You can also follow us on fb'
output = re.findall(r"(?<=\s).*?\s\[!at\].*?\s.*?\s",item_str)[0]
print(output)
Output I wish to get:
kaieldentsome [!at] gmail.com.
CodePudding user response:
You could use
(?<=\s)\S \s\[!at\]\s\S \.\S
(?<=\s)
Positive lookbehind, assert a whitespace char to the left\S
Match 1 non whitespace chars\s\[!at\]\s
Match[!at]
between whitespace chars\S \.\S
Match 1 non whitespace chars with at least a dot
Note that there has to be a whitespace to the left present. If that is not mandatory, you can omit (?<=\s)
CodePudding user response:
\S \s*\[!at\]\s*\S
Will also work if there is no whitespace before and/or after the [!at]
.
If you want to exclude the trailing .
, you can do this:
(\S \s*\[!at\]\s*\S )\.?
Then take the first group.
CodePudding user response:
Regex is usually greedy. Meaning it will match as much as possible. So by using .*
, it'll match all characters, including whitespaces.
If you use \S*
instead, which will match everything, except for whitespaces, you will get the desired result.
Updated code:
import re
item_str = 'If you have any questions or concerns, you may contact us at kaieldentsome [!at] gmail.com. You can also follow us on fb'
output = re.findall(r"(?<=\s)\S*?\s\[!at\]\S*?\s\S*?\s",item_str)[0]
print(output)
Try it here: https://regex101.com/r/ZMw139/1