How do i filter out files in beautiful soup-CodePudding

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://papers.gceguide.com/A Levels/Physics (9702)/2015/"

folder_location = r'C:\Users\'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

How do I filter out unnecessary stuff and just make it download all the pdf files just containing 'qp_2'

CodePudding user response：

To download any pdf that contains qp_2 in its filename you can use next example:

import requests
from bs4 import BeautifulSoup


url = "https://papers.gceguide.com/A Levels/Physics (9702)/2015/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for n in soup.select('a.name[href*="qp_2"]'):
    print("Downloading", n.text)
    with open(n.text, "wb") as f_out:
        r = requests.get(url   n.text)
        f_out.write(r.content)

Prints and downloads the files:

Downloading 9702_s15_qp_21.pdf
Downloading 9702_s15_qp_22.pdf
Downloading 9702_s15_qp_23.pdf
Downloading 9702_w15_qp_21.pdf
Downloading 9702_w15_qp_22.pdf
Downloading 9702_w15_qp_23.pdf

CodePudding user response：

Select your links more specific and check for both qp_2 and .pdf in your css selector:

soup.select("a[href*='qp_2'][href$='.pdf']")

Alternativ is to double check while iterating:

for a in soup.select("a[href*='qp_2']"):
    if a['href'].endswith('.pdf'):
        with open(a['href'], "wb") as f_out:
            r = requests.get(url   a['href'])
            f_out.write(r.content)

Example

import requests
from bs4 import BeautifulSoup


url = "https://papers.gceguide.com/A Levels/Physics (9702)/2015/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for a in soup.select("a[href*='qp_2'][href$='.pdf']"):
    with open(a['href'], "wb") as f_out:
        r = requests.get(url   a['href'])
        f_out.write(r.content)