I'm working on a web scraper. Among the fields it scrapes there is a Description tag like this one, different for each product:
<div style="overflow: hidden; display: block;">
Black Tshirt
<br>
<br>
REF.: V23T87C88EC
<br>
<br>
COMPOSIÇÃO:
<br>
90% Poliamida
</div>
I can get the content of the description tag without problems, but I also need to get the value of REF inside the description (V23T87C88EC for this example).
The problem is this description is always different for all products, HOWEVER there is ALWAYS a "REF.: XXXXXXXXX" substring in there. The length of the REF id can change, and it can be anywhere in the string. What's the best way to always get the REF id?
CodePudding user response:
Possible solution is the following:
html = """<div style="overflow: hidden;
display: block;">
Black Tshirt
<br>
<br>
REF.: V23T87C88EC
<br>
<br>
COMPOSIÇÃO:
<br>
90% Poliamida
</div>"""
import re
pattern = re.compile(r'REF\.: (. ?)$')
found = pattern.findall(html)
Returns ['V23T87C88EC']
CodePudding user response:
You can do this with a regex (read more about regex: https://docs.python.org/3/howto/regex.html) :
html = '''
<div style="overflow: hidden; display: block;">
Black Tshirt
<br>
<br>
REF.: V23T87C88EC
<br>
<br>
COMPOSIÇÃO:
<br>
90% Poliamida
</div>
'''
import re
myref = re.search (r"(?<=REF.: )\w ", html)[0]
print(myref)
# V23T87C88EC