I am trying to scrape the data from a website. It has an excel sheet inside the tag a and href. I have tried multiple ways using requests and beautifulsoup but i am not getting the link of the excel sheet.
after inspecting the element i get the details as below: -
I have tried the below code: , but every time when I try ,I get all the links except this xlsx file.
from bs4 import BeautifulSoup
import urllib
import re
import requests
html_page = urllib.request.urlopen(url)
links = []
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
print(link.get('href'))
links.append(link.get('href'))
Output which i get has all the links except the above mentioned excel file url.
Can anyone help me to get the URL which changes daily hence i need to scrape it using http regex or xlsx (tried this also for link in soup.find_all(attrs={'href': re.compile("xlsx")}
))
Expected output is the url to excel file :- https://ppac.gov.in/uploads/reports/1673497201_english_1_Crude Oil FOB Price (Indian Basket).xlsx
CodePudding user response:
Data comes via XHR request (check your browsers dev tools to get also information for payload data) and is rendered dynamically by browser, so best way would be to use the same request to get your data as JSON.
Example
import requests
url = f'https://ppac.gov.in/AjaxController/getInternationalPricesCrudeOil'
requests.post(
url,
data={
'financialYear':'2022-2023',
'reportBy':4,
'pageId':30
}).json()['result']['1']['file_name']
Output
https://ppac.gov.in/uploads/reports/1673497201_english_1_Crude Oil FOB Price (Indian Basket).xlsx
CodePudding user response:
Does the button position change? if not, I'd use xpath to click or retrieve the url, it seems like if you do a driver.get_element_by_xpath(the button xpath).click() on that, it'd start downloading