Home > Back-end >  Python regex take characters around word
Python regex take characters around word

Time:11-11

I would like to include 5 characters before and after a specific word is matched in my regex query. Those words are in a list and I iterate over it.

See example below, this is what I tried:

import re

text = "This is an example of quality and this is true."

words = ['example', 'quality']

words_around = []

for word in words:
    neighbors = re.findall(fr'(.{0,5}{word}.{0,5})', str(text))
    words_around.append(neighbors)

print(words_around)

The output is empty. I would expect an array containing ['s an exmaple of q', 'e of quality and ']

CodePudding user response:

You can use PyPi regex here that allows an infinite length lookbehind patterns:

import regex
import pandas as pd

words = ['example', 'quality']

df = pd.DataFrame({'col':[
    "This is an example of quality and this is true.",
    "No matches."
    ]})

rx = regex.compile(fr'(?<=(.{{0,5}}))({"|".join(words)})(?=(.{{0,5}}))')

def extract_regex(s):
    return ["".join(x) for x in rx.findall(s)]
    
df['col2'] = df['col'].apply(extract_regex)

Output:

>>> df
                                               col                                    col2
0  This is an example of quality and this is true.  [s an example of q, e of quality and ]
1                                      No matches.                                      []

Both the pattern and how it is used are of importance.

The fr'(?<=(.{{0,5}}))({"|".join(words)})(?=(.{{0,5}}))' part defines the regex pattern. This is a "raw" f-string literal, f makes it possible to use variables inside the string literal, but it also requires to double all literal braces inside it. The pattern - given the current words list - looks like (?<=(.{0,5}))(example|quality)(?=(.{0,5})), see its demo online. It captures 0-5 chars before the words inside a positive lookbehind, then captures the words, and then captures the next 0-5 chars in a positive lookahead (lookarounds are used to make sure any overlapping matches are found).

The ["".join(x) for x in rx.findall(s)] part joins the groups of each match into a single string, and returns a list of matches as a result.

  • Related