I'm writing a webscraper to scrape the data of user profiles from a specific website, using Python, BS4, Selenium. I'm trying to scrape data from a particular section of the site - the particular sections node has no unique identifying features from other section nodes, other than an ID number preceded by the word 'ember', like so:
<section id="ember31" >
The section ID can be "ember" followed by two OR three numbers - these numbers randomise everytime the page is loaded. There are multiple of these throughout the page - but I only wish to select one.
This is fine for scraping one profile, but how would I ensure that my code selects the correct node each time it runs through a new profile?
Thanks in advance.
CodePudding user response:
Try to select both the id and the class attributes
//*[@ and @id="ember31"]
CodePudding user response:
If the pattern is id starts with ember and you use BeautifulSoup
try the css selectors
:
soup.select_one('[id^=ember].artdeco-card.ember-view.pv-top-card')
Example
Used select
instead of select_one
for demonstration, that it only select the specific one.
from bs4 import BeautifulSoup
html='''
<section id="ember31" >section #1</section>
<section id="ember4711" >section #2</section>
<section id="embar123" >section #3</section>
<section id="notember123" >section #4</section>
'''
soup = BeautifulSoup(html)
soup.select('[id^=ember].artdeco-card.ember-view.pv-top-card')
Output
[<section id="ember31">section #1</section>]