Home > Enterprise >  Scraping this page with Python
Scraping this page with Python

Time:02-13

I'm trying to scrape the tables that you get from this page after selecting a specific date (ex: jan 2015 to fev 2022) - http://vital.minambiente.gov.co/SILPA_UT_PRE/RUIA/ConsultarSancion.aspx?Ubic=ext

When I tried Selenium, I had trouble finding how to click on the pages (1, 2, 3, 4, 5...) at the bottom so it would send me to the next table.

I'm trying this:

driver.find_element(By.XPATH, 'href="javascript:__doPostBack(\'ctl00$ContentPlaceHolder1$grdSanciones\',\'Page$1\')"').click()

But I get an error: InvalidSelectorException: Message: invalid selector: Unable to locate an element with the xpath expression href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdSanciones','Page$1')" because of the following error: TypeError: Failed to execute 'evaluate' on 'Document': The result is not a node set, and therefore cannot be converted to the desired type.

How would you do that? Thanks!

CodePudding user response:

You have to use Selenium since it appears to be js based.

Buena suerte.

CodePudding user response:

That's a pretty nasty attribute value to search on due to the nested quotes. It also looks like something that might change frequently.

A better approach might be to use CSS selectors. .GridPager a will select all of those links, for example. And if you keep track of the page you're on you can find the "next" page link based on that.

Beyond page 10 it mostly just loops, though there is an extra ... link back that we need to account for. This is really a double pagination: pages 1 – 10, then pages 11 – 20, etc.

So something like this:

page = 1
offset = 0  # For the "..." back-a-page-of-pages link
navigation_links = driver.find_elements(By.CSS_SELECTOR, ".GridPager a")

for link in navigation_links:
    # Do whatever you want with the data on the current page

    # Next page
    navigation_links[(page % 10) - 1   offset].click()

    page  = 1
    offset = 1 if page > 10 else 0
  • Related