Scraping specific pdfs from different websites-CodePudding

First question here. I need to download a specific pdf from every url. I need just the pdf of the european commission proposal from each url that I have, which is always in a specific part of the page

[Here the part from the website that I would always need in pdf form]. The European Commission proposal

And here is the html code of it (The part that is interesting for me is :

"http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" is the pdf that I need, as you can see from the image )

 [<a  href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="externalDocument">COM(2020)0791</a>, <a href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="_blank">
 <span >
 COM(2020)0791
                </span>
 <span > </span>
 </a>, <a href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="_blank">
 <span >
 COM(2020)0791
                </span>
<span > </span>
</a>]

I used the subsequent code for the task, so that it takes every url from my csv file and it goes in each page to download every pdf. The problem is that with this approach it takes also other pdf which I do not need. It is fine for me if it downloads it but I need to distinguish them from the part where they are downloaded, this is why i am asking here to download all the pdf from just one specific subsection. So if it is possible to distinguish them in the name by section it would be also fine, for now this code gives me back 3000 pdfs, i need around 1400, one for each link, and if it keeps the name of the link it could be also easier for me, but is not my main worry since they are ordered in order of recall from the csv file and it will be easy to tidy them after.

In synthesis this code here needs to become a code which downloads only from one part of the site, instead of all of it:

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#import pandas

#data = pandas.read_csv('urls.csv')
#urls = data['urls'].tolist()

urls = ["http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2020/0350", "http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2012/0299", "http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2013/0092"]
#url="http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2013/0092"


folder_location = r'C:\Users\myname\Documents\R\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

for url in urls:
 response = requests.get(url)
 soup= BeautifulSoup(response.text, "html.parser")     
 for link in soup.select("a[href$='EN.pdf']"):
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

for example I did not want do download this file here follow up document which is a follow up document which starts with com, ends with EN.pdf, but has a different date because it is a follow up (in this case 2018) as you can see from the link: https://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2018/0564/COM_COM(2018)0564_EN.pdf

CodePudding user response：

The links in your html file all seem to be to the same pdf [or at least they have the same filename], so it'll just be downloading and over-writing the same document. Still, if you just want to target only the first of those links, you could include the class externalDocument in your selector.

 for link in soup.select('a.externalDocument[href$="EN.pdf"]'):

If you want to target a specific event like 'Legislative proposal published', then you could do something like this:

# urls....os.mkdir(folder_location)

evtName = 'Legislative proposal published'

tdSel, spSel, aSel = 'div.ep-table-cell', 'span.ep_name', 'a[href$="EN.pdf"]'
dlSel = f'{tdSel} {tdSel} {tdSel} {spSel}>{aSel}' 
trSel = f'div.ep-table-row:has(>{dlSel}):has(>{tdSel} {tdSel} {spSel})'

for url in urls:
    response = requests.get(url)
    soup= BeautifulSoup(response.text, "html.parser")

    pgPdfLinks = [
        tr.select_one(dlSel).get('href') for tr in soup.select(trSel) if 
        evtName.strip().lower() in 
        tr.select_one(f'{tdSel} {tdSel} {spSel}').get_text().strip().lower()
        ## if you want [case sensitive] exact match, change condition to
        # tr.select_one(f'{tdSel} {tdSel} {spSel}').get_text() == evtName
    ]     
    for link in pgPdfLinks[:1]:
        filename = os.path.join(folder_location, link.split('/')[-1])
        with open(filename, 'wb') as f:
            f.write(requests.get(urljoin(url, link)).content)

[The [:1] of pgPdfLinks[:1] is probably unnecessary since more than one match isn't expected, but it's there if you want to absolutely ensure only one download per page.]

Note: you need to be sure that there will be an event named evtName with a link matching aSel (a[href$="EN.pdf"] in this case) - otherwise, no PDF links will be found and nothing will be downloaded for those pages.

if it keeps the name of the link it could be also easier for me

It's already doing that in your code, since there doesn't seem to be much difference between link['href'].split('/')[-1] and link.get_text().strip(), but if you meant that you want the page link [i.e. the url], you could include the procnum (since that seems to be an identifying part of url) in your filename:

    # for link in...
        procnum = url.replace('?', '&').split('&procnum=')[-1].split('&')[0]
        procnum = ''.join(c if (
            c.isalpha() or c.isdigit() or c in '_-[]'
        ) else ('_' if c == '/' else '') for c in procnum)
        filename = f"proc-{procnum} {link.split('/')[-1]}"
        # filename = f"proc-{procnum} {link['href'].split('/')[-1]}" # in your current code

        filename = os.path.join(folder_location, filename)
        with open(filename, 'wb') as f:
            f.write(requests.get(urljoin(url, link)).content)
            # f.write(requests.get(urljoin(url['href'], link)).content) # in your current code

So, [for example] instead of saving to "COM_COM(2020)0791_EN.pdf", it will save to "proc-OLP_2020_0350 COM_COM(2020)0791_EN.pdf".

CodePudding user response：

I have tried to solve this by adding different steps so that it can check at the same time what year the pdf comes from and add it to the name. The code is below, and it is an improvement, however the response above by Driftr95 is way better than mine, if someone wants to replicate this they should use his code.

    import requests
import pandas
import os
from urllib.parse import urljoin
from bs4 import BeautifulSoup

data = pandas.read_csv('urls.csv') 
urls = data['url'].tolist()
years = data["yearstr"].tolist()
numbers = data["number"].tolist()

folder_location = r'C:\Users\dario.marino5\Documents\R\webscraping'
if not os.path.exists(folder_location):
    os.mkdir(folder_location)

for url, year, number in zip(urls, years, numbers):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")     

    for link in soup.select("a[href$='.pdf']"):
        if year in link['href']:
            # Construct the filename with the number from the CSV file
            filename = f'document_{year}_{number}.pdf'
            filename = os.path.join(folder_location, filename)

            # Download the PDF file and save it to the filename
            with open(filename, 'wb') as f:
                f.write(requests.get(urljoin(url, link['href'])).content)