Downloading PDFs from CAG-CodePudding

I am trying to download multiple PDFs from CAG website (link https://cag.gov.in/en/state-accounts-report?defuat_state_id=64). I am using the following code-

url='https://cag.gov.in/en/state-accounts-report?defuat_state_id=64'

response=requests.get(url)

response

soup=BeautifulSoup(response.text,'html.parser')

soup

for link in soup.select("a[href$='.pdf']"):
   
    print(link)

for link in soup.select("a[href$='.pdf']"):    
    
    filename = os.path.join(folder_location,link['href'].split('/')[-1])  

     
    with open(filename, 'wb') as f:

      f.write(requests.get(urljoin(url,link['href'])).content)

This is giving me all the PDFs from the whole page, I wish to download the PDF under the tab 'Monthly Key Indicators' only. Please suggest the necessary changes in the code to do that.

CodePudding user response：

You could try narrowing down the tab from which the links are selected. The tab id can be found using

tabId = soup.find(
    lambda t: t.name == 'a' and t.get('href') and 
    t.get('href').startswith('#tab') and # just in case
    'Monthly Key Indicators' == t.get_text(strip=True)
).get('href')

(Or, if it's always the same id, you can just set as tabId = "#tab-360". ) Then, you can just change your selection to

soup.select(f"{tabId} a[href$='.pdf']")

But aren't you downloading the same file 3x with each report? You could alter your for-loop to only download from the links with "Download" as text:

pdfLinks = soup.select(f"{tabId} a[href$='.pdf']")
pdfLinks = [pl for pl in pdfLinks if pl.get_text(strip=True) == 'Download']
for link in pdfLinks:
  #download