Home > Mobile >  Failed to capture a certain portion of text out of a long text using regex
Failed to capture a certain portion of text out of a long text using regex

Time:08-25

I'm trying to scrape a portion of text out of a long text using regex.

Original text: If you have any questions or concerns, you may contact us at kaieldentsome [!at] gmail.com. You can also follow us on fb

Portion I'm interested in: kaieldentsome [!at] gmail.com.

It's not necessary that contact us at will always be present there.

I've tried with:

import re

item_str = 'If you have any questions or concerns, you may contact us at kaieldentsome [!at] gmail.com. You can also follow us on fb'
output = re.findall(r"(?<=\s).*?\s\[!at\].*?\s.*?\s",item_str)[0]
print(output)

Output I wish to get:

kaieldentsome [!at] gmail.com.

CodePudding user response:

You could use

(?<=\s)\S \s\[!at\]\s\S \.\S 
  • (?<=\s) Positive lookbehind, assert a whitespace char to the left
  • \S Match 1 non whitespace chars
  • \s\[!at\]\s Match [!at] between whitespace chars
  • \S \.\S Match 1 non whitespace chars with at least a dot

Note that there has to be a whitespace to the left present. If that is not mandatory, you can omit (?<=\s)

Regex demo

CodePudding user response:

\S \s*\[!at\]\s*\S 

Will also work if there is no whitespace before and/or after the [!at].

If you want to exclude the trailing ., you can do this:

(\S \s*\[!at\]\s*\S )\.?

Then take the first group.

CodePudding user response:

Regex is usually greedy. Meaning it will match as much as possible. So by using .*, it'll match all characters, including whitespaces.

If you use \S* instead, which will match everything, except for whitespaces, you will get the desired result.

Updated code:

import re

item_str = 'If you have any questions or concerns, you may contact us at kaieldentsome [!at] gmail.com. You can also follow us on fb'
output = re.findall(r"(?<=\s)\S*?\s\[!at\]\S*?\s\S*?\s",item_str)[0]
print(output)

Try it here: https://regex101.com/r/ZMw139/1

  • Related