I would like to include 5 characters before and after a specific word is matched in my regex query. Those words are in a list and I iterate over it.
See example below, this is what I tried:
import re
text = "This is an example of quality and this is true."
words = ['example', 'quality']
words_around = []
for word in words:
neighbors = re.findall(fr'(.{0,5}{word}.{0,5})', str(text))
words_around.append(neighbors)
print(words_around)
The output is empty. I would expect an array containing ['s an exmaple of q', 'e of quality and ']
CodePudding user response:
You can use PyPi regex here that allows an infinite length lookbehind patterns:
import regex
import pandas as pd
words = ['example', 'quality']
df = pd.DataFrame({'col':[
"This is an example of quality and this is true.",
"No matches."
]})
rx = regex.compile(fr'(?<=(.{{0,5}}))({"|".join(words)})(?=(.{{0,5}}))')
def extract_regex(s):
return ["".join(x) for x in rx.findall(s)]
df['col2'] = df['col'].apply(extract_regex)
Output:
>>> df
col col2
0 This is an example of quality and this is true. [s an example of q, e of quality and ]
1 No matches. []
Both the pattern and how it is used are of importance.
The fr'(?<=(.{{0,5}}))({"|".join(words)})(?=(.{{0,5}}))'
part defines the regex pattern. This is a "raw" f-string literal, f
makes it possible to use variables inside the string literal, but it also requires to double all literal braces inside it. The pattern - given the current words
list - looks like (?<=(.{0,5}))(example|quality)(?=(.{0,5}))
, see its demo online. It captures 0-5 chars before the words
inside a positive lookbehind, then captures the words
, and then captures the next 0-5 chars in a positive lookahead (lookarounds are used to make sure any overlapping matches are found).
The ["".join(x) for x in rx.findall(s)]
part joins the groups of each match into a single string, and returns a list of matches as a result.