Get substring with code from different strings-CodePudding

I'm working on a web scraper. Among the fields it scrapes there is a Description tag like this one, different for each product:

<div  style="overflow: hidden; display: block;">
Black Tshirt
<br>
<br>
REF.: V23T87C88EC
<br>
<br>
COMPOSIÇÃO:
<br>
90% Poliamida
</div>

I can get the content of the description tag without problems, but I also need to get the value of REF inside the description (V23T87C88EC for this example).

The problem is this description is always different for all products, HOWEVER there is ALWAYS a "REF.: XXXXXXXXX" substring in there. The length of the REF id can change, and it can be anywhere in the string. What's the best way to always get the REF id?

CodePudding user response：

Possible solution is the following:

html = """<div  style="overflow: hidden;  
display: block;">
Black Tshirt
<br>
<br>
REF.: V23T87C88EC
<br>
<br>
COMPOSIÇÃO:
<br>
90% Poliamida
</div>"""

import re

pattern = re.compile(r'REF\.: (. ?)$')

found = pattern.findall(html)

Returns ['V23T87C88EC']

REGEX DEMO

CodePudding user response：

You can do this with a regex (read more about regex: https://docs.python.org/3/howto/regex.html) :

html = '''
<div  style="overflow: hidden; display: block;">
Black Tshirt
<br>
<br>
REF.: V23T87C88EC
<br>
<br>
COMPOSIÇÃO:
<br>
90% Poliamida
</div>
'''

import re

myref = re.search (r"(?<=REF.: )\w ", html)[0]

print(myref)

# V23T87C88EC