Home > Mobile >  I am unable to web-scrape a URL from a hyperlink in this website. | Web-scrape, Python
I am unable to web-scrape a URL from a hyperlink in this website. | Web-scrape, Python

Time:10-09

I am unable to get the links from this website called: I want to get the drive links from each hyperlink

I want to get all of the google drive links from that hyperlinks on the left column

CodePudding user response:

  1. Selenium solution:
from selenium import webdriver
import time

options = webdriver.ChromeOptions()
# options.add_argument('--headless')
driver = webdriver.Chrome(options = options)
driver.get("https://riwayat-file-vaksinasi-dki-jakarta-jakartagis.hub.arcgis.com/")
time.sleep(3)  # wait for page to download
links = [i.get_attribute('href') for i in driver.find_elements_by_css_selector("td p a")]
print(links)
driver.quit()

If you didn't use Selenium before, do not forget to install chromedriver.

  1. requests urllib.parse solution (inspired by the comment of @tromgy)
import requests
import bs4
from urllib.parse import unquote

req = requests.get('https://riwayat-file-vaksinasi-dki-jakarta-jakartagis.hub.arcgis.com/')
soup = bs4.BeautifulSoup(req.text,"lxml")
url = unquote(soup.select_one("script#site-injection").string[23:].strip())

soup = bs4.BeautifulSoup(url,"lxml")
links = [i['href'][2:-2] for i in soup.select("td p a")]
links

If you want to get only the left column of the table, then use td:first-of-type p a as css selector. It simply inspects the first td element and ignore the second (right) one.

  • Related