Home > Mobile >  What is the most efficient way to extract and process files from the link?
What is the most efficient way to extract and process files from the link?

Time:11-29

I am able to browse all the links I need, but these links are redirecting me to the websites which have another links with pdf files, I have to open and process these pdfs. But I do not know what is the most efficient way to do it

import requests
from bs4 import BeautifulSoup
import re
 
url = 'https://oeil.secure.europarl.europa.eu/oeil/popups/ficheprocedure.do?reference=2014/0124(COD)&l=en'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
 
urls = []
for link in soup.find_all("a",href=re.compile("AM")):
    print(link.get('href'))
 

Output:

https://www.europarl.europa.eu/doceo/document/EMPL-AM-544465_EN.html
https://www.europarl.europa.eu/doceo/document/EMPL-AM-541655_EN.html
https://www.europarl.europa.eu/doceo/document/EMPL-AM-551891_EN.html
https://www.europarl.europa.eu/doceo/document/EMPL-AM-544465_EN.html
https://www.europarl.europa.eu/doceo/document/EMPL-AM-541655_EN.html
https://www.europarl.europa.eu/doceo/document/EMPL-AM-551891_EN.html

CodePudding user response:

For all links that you crawl from the main url, you need to do exactly the same as before (request, bs4, extract hrefs).

  • Then check if href of link ends with ".pdf"
  • If href is a relative path of the pdf file, use urllib to extract the domain from the website url and concatenate the domain and the pdf file name again

E. g.:

from urllib.parse import urlparse

domain = urlparse("https://www.europarl.europa.eu/doceo/document/EMPL-AM-544465_EN.html").netloc
  • Do another get request to retrieve the context of the pdf file

CodePudding user response:

You've done it right. You just need a simple modification:

Solution for this specific problem

You have done the hardest part. Just replace ".html" with ".pdf" and then download the file is a manner like this:

for link in soup.find_all("a",href=re.compile("AM")):
    page_url = str(link.get('href'))
    pdf_url = page_url.replace('.html', '.pdf')
    pdf_response = requests.get(url)
    with open('/blah_blah_blah/metadata.pdf', 'wb') as f:
        f.write(response.content)

What was the problem?

Dear friends, many websites have a dedicated webpage for any of their links. Usually it is so, for the sake of "SEO" purposes! The good point is: "There is often a relation between the page link and the target file's link.

For example here we have:

https://www.europarl.europa.eu/doceo/document/EMPL-AM-544465_EN.html

And when we check that page we find that the url is:

https://www.europarl.europa.eu/doceo/document/EMPL-AM-544465_EN.pdf

So you just need to change the ".html" tail to ".pdf" and then go for a download.

What to do if there is no relation?

Then you should do exactly what a human does. Open/Fetch the page and then extract the content link explicitly. After that you can download the target file.

Do we have even harder situations?

Yes. Sometimes the website has AJAX and ... operations to get the link. In those cases I suggest trying to follow the behavior of the browser and to check if there exists a pattern between the link (usually an API) and the content.

But there exists even harder situations that you will use alternatives like "Selenium".

  • Related