Home > Software design >  Get text out of tags with Python and Selenium
Get text out of tags with Python and Selenium

Time:11-15

I have been trying to scrape a webpage with Python and Selenium and ran into this problem. Basically, the webpage that I'm scraping shows information in a table with pagination, so I want to get the information from all pages. This is the HTML for the pagination system when I'm at a page that's not the last page (page 2 in this case):

<span >
   " ["
   <a href="?page=1">First</a>
   "/"
   <a href="?page=2">Previous</a>
   "] "
   <a href="?page=1" title="Go to page 1">1</a>
   ", "
   <strong>2</strong>
   ", "
   <a href="?page=3" title="Go to page 3">3</a>
   " ["
   <a href="?page=3">Next</a>
   "/"
   <a href="?page=3">Last</a>
   "] "
</span>

And this is the HTML I get when I reach the last page (page 3 in this case):

<span >
   " ["
   <a href="?page=1">First</a>
   "/"
   <a href="?page=2">Previous</a>
   "] "
   <a href="?page=1" title="Go to page 1">1</a>
   ", "
   <a href="?page=2" title="Go to page 2">2</a>
   ", "
   <strong>3</strong>
   " [Next/Last]"
</span>

In this case, page 3 is selected and appears as <strong>, but this changes depending on the current page.

In order to check if I'm at the last page, I want to check if the text "[Next/Last]" is the next text after the <strong> tag to stop the while loop that retrieves the information, but since this text is out of any tag, I didn’t find any way to check this. How can I check it?

CodePudding user response:

We can look for a with an href attribute and Next text content. The same can be done for the Last text.

With Selenium / Python you can simply use this line:

if driver.find_elements(By.XPATH, "//span[@='pagelinks']//a[@href][contains(text(),'Next')]"):
    # Do what you need to do while still not on the last
    # page. Otherwise, this block will be skipped.
  • Related