I want to extract information from a div tag which has some specific classes.
Class are in the format of abc def jss238 xyz
Now, the jss class number keeps changing, so after some time ,the classes will become abc def jss384 xyz
What is the best way to extract information so that the code doesn't break if the tags change as well.
The current code that I using is
val = soup.findAll('div', class_="abc def jss328 xyz")
I feel Regex can be a good way, but can I also not use jss class and use the other 3 only to search?
CodePudding user response:
SO yes you can use regex to find the pattern that has abc def <pattern of 3 letters and 3 digits> xyz
Personally, I would see if you can get the data from the source. When classes change like that, it's usually because the page is rendered through javascript, but it needs to put the data in there and get it from somewhere. If you share the url and what data you are after, I could see if thats the case. But here's the regex version:
from bs4 import BeautifulSoup
import re
html = '''<div >jss238 text</div>
<div >jss384 text</div>
<div >doesn't match the pattern</div>'''
soup = BeautifulSoup(html, 'html.parser')
regex = re.compile('abc def \w{3}\d{3} xyz')
specialDivs = soup.find_all('div', {'class':regex})
for each in specialDivs:
print(f'html: {each}\tText: {each.text}')
Output:
html: <div >jss238 text</div> Text: jss238 text
html: <div >jss384 text</div> Text: jss384 text