Regex: Identify string occurrence in first X characters after keyword-CodePudding

Suppose the following string:

text = r"Microsoft enables digital transformation for the era of an intelligent cloud and an intelligent edge. 
    
SOURCE Microsoft Corp."

Goal:

I want to check if the company's name (Microsoft in the above example) occurs within the first X (250 for example) characters after the keyword "SOURCE".

Attempt:

source = re.compile(r"SOURCE.*")
re.findall(source,text)
#output ['SOURCE Microsoft Corp.']

In order to account for the character limitation in which the keyword should occur, I thought of using the .split() function on the output string and count the position at which the company's name occurs. This should work just fine if the company's name consists of one word only. However, in cases where the company name includes multiple words (e.g., "Procter & Gamble") splitting the output string would result in ['SOURCE', 'Procter', '&', 'Gamble'] so that searching for the position of "Procter & Gamble" in this list wouldn't give back any results. Is there a way I can implement the restriction that the company name has to occur after X characters in the regex command?

CodePudding user response：

A performant alternative to a regex would be str.find with start and end parameters:

p1 = t.find('SOURCE')
p2 = t.find('Microsoft', p1, p1   limit - len('SOURCE'))

p2 will be > 0 if 'Microsoft' is found within limit chars from 'SOURCE' and -1 otherwise.

CodePudding user response：

You don't need regex for this. You can slice the string after finding the upper index of SOURCE with rfind, then check if the word is in the sliced string:

text = "Microsoft enables digital transformation for the era of an intelligent cloud and an intelligent edge. SOURCE Microsoft Corp."
print('Microsoft' in text[text.rfind('SOURCE'):250])

Output:

True

CodePudding user response：

You could put something between SOURCE and the company name. So if the company name (Microsoft in this example) is 9 characters and you need it to be within the first 200 characters directly following SOURCE there can be from 0 up to a maximum 200-9=191 characters before the company name. So you would write:

re.findall('SOURCE.{0,191}Microsoft', text)

The .{a,b} expression would match any character from a to b number of times.

CodePudding user response：

Another solution using re (regex101):

import re

text = """Microsoft enables digital transformation for the era of an intelligent cloud and an intelligent edge. 
    
SOURCE Microsoft Corp."""

pat = re.compile(r"(?<=SOURCE)(?=.{,250}Microsoft).*?Microsoft", flags=re.S)

if pat.search(text):
    print("Found")

Prints:

Found

CodePudding user response：

You can use the following RegEx to get everything after 'SOURCE' after X occurrences:

SOURCE.{250}(.*)

I only put 250 on the RegEx as an example. You can use any number you like. For example, to match exactly 'Microsoft Corp.' you could do SOURCE.{1}(.*).

The parentheses define a RegEx capture group, which are basically output variables. In Python you can match the capture groups using re.findall():

>>> import re
>>> r = re.compile(r'SOURCE.{1}(.*)')
>>> r.findall('SOURCE Microsoft Corp.')
['Microsoft Corp.']

EDIT: Many of this post's answers rely on using 'Microsoft' for finding the company name, but that doesn't make much sense IMO