Home > OS >  Find two words in a string separated by an unknow number of characters, regular expression, python
Find two words in a string separated by an unknow number of characters, regular expression, python

Time:03-23

I am currently building a web scraper. I want to find the profiles which match a location specified by two words (WORD1, WORD2). The structure of the webpage source code for profiles in that location is always the same: WORD1(up to 100 characters, including new line and special characters)WORD2. I don't want instances separated only by one space, as it gets confused with another instance of the address, which is generic for the webpage and not the profile, so lengths have to be larger than 1, and less than 100.

Examples below:

'WORD1</span></span><br />\\r\\n\\t<span style="line-height: 19.84px;">WORD2'
'WORD1<br />\\r\\n\\tWORD2'
'WORD1</span><br />\\r\\n\\tWORD2'

and not 'WORD1 WORD2'

Right now I use the code:


    hits=[]
    keywords = ['WORD1<br />\\r\\n\\tWORD2', 'WORD1</span><br />\\r\\n\\tWORD2', 'WORD1</span></span><br />\\r\\n\\t<span style="line-height: 19.84px;">WORD2']
    for ulr in ulr_list:
        req = Requests(url, headers={'User-Agent': 'Mozilla/5.0'})
        webpage = urlopen(req).read()
        for i in keywords:
            if i in str(webpage):
                hits.append('WORD2')
            else:
                hits.append('-')

Checking each iteration one by one is time consuming, inefficient and ultimately inaccurate, as I have noticed there are at least three iterations of the source code, but there might be more and I will miss those.

I would appreciate having a regular expression for finding the words 'WORD1' and 'WORD2' separated by any character including new lines, up to 100 characters, and more than 1 character. Alternatively, what would the best approach for achieving this? Thank you!

CodePudding user response:

IIUC, you could a range of characters with lazy expansion {2,99}? (2 to 99, included, use {2,100} if you want to include 100 as well), a negative lookahead (?!WORD1) to prevent matching an inner WORD1, and re.DOTALL to also match the newlines:

re.findall(r'WORD1(?:(?!WORD1).){2,99}?WORD2', text, re.DOTALL)

or if yo want to allow a single character, except spaces:

re.findall(r'WORD1\S(?:(?!WORD1).){2,98}?WORD2', text, re.DOTALL)

output:

['WORD1</span></span><br />\n\t<span style="line-height: 19.84px;">WORD2',
 'WORD1<br />\n\tWORD2',
 'WORD1</span><br />\n\rWORD2']

or, for only the content between the words, add a capturing group:

re.findall(r'WORD1((?:(?!WORD1).){2,99}?)WORD2', text, re.DOTALL)

output:

['</span></span><br />\n\t<span style="line-height: 19.84px;">',
 '<br />\n\t',
 '</span><br />\n\r']

test input (added extra WORD1/WORD2 to demonstrate minimal match with keyword exclusion):

text = '''WORD1 xx WORD1</span></span><br />
\t<span style="line-height: 19.84px;">WORD2 xx WORD2WORD1<br />
\tWORD2WORD1</span><br />
\rWORD2
'''

test the regex

  • Related