Home > Enterprise >  Python beautifulsoup not able to extract a hyperlink from href tag
Python beautifulsoup not able to extract a hyperlink from href tag

Time:01-13

I am trying to scrape the data from a website. It has an excel sheet inside the tag a and href. I have tried multiple ways using requests and beautifulsoup but i am not getting the link of the excel sheet.

Website url - enter image description here

after inspecting the element i get the details as below: - enter image description here

I have tried the below code: , but every time when I try ,I get all the links except this xlsx file.

from bs4 import BeautifulSoup
import urllib
import re
import requests
html_page = urllib.request.urlopen(url)
links = []
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
    print(link.get('href'))
    links.append(link.get('href'))

Output which i get has all the links except the above mentioned excel file url. Can anyone help me to get the URL which changes daily hence i need to scrape it using http regex or xlsx (tried this also for link in soup.find_all(attrs={'href': re.compile("xlsx")}))

Expected output is the url to excel file :- https://ppac.gov.in/uploads/reports/1673497201_english_1_Crude Oil FOB Price (Indian Basket).xlsx

CodePudding user response:

Data comes via XHR request (check your browsers dev tools to get also information for payload data) and is rendered dynamically by browser, so best way would be to use the same request to get your data as JSON.

Example

import requests

url  = f'https://ppac.gov.in/AjaxController/getInternationalPricesCrudeOil'

requests.post(
    url, 
    data={
        'financialYear':'2022-2023',
        'reportBy':4,
        'pageId':30
    }).json()['result']['1']['file_name']

Output

https://ppac.gov.in/uploads/reports/1673497201_english_1_Crude Oil FOB Price (Indian Basket).xlsx

CodePudding user response:

Does the button position change? if not, I'd use xpath to click or retrieve the url, it seems like if you do a driver.get_element_by_xpath(the button xpath).click() on that, it'd start downloading

  • Related