I am trying to use wget to download a pdf file. I have a direct link to the pdf document and input the following into command line:
wget -A pdf -nc -np -nd --content-disposition --wait=1 --tries=5 "https://prospektbestellung.nordseetourismus.de/mediafiles/Sonstiges/Ortsprospekte/amrum2021.pdf"
This uses a lot of unnecessary options, but they should not mess with the outcome, which is:
HTTP request sent, awaiting response... Read error (Unknown error) in headers.
Is there any way to fix this directly using wget or are there any other solutions, preferably in Python, which I could consider?
CodePudding user response:
Your oneliner works for me. I've successfully download pdf.
wget -A pdf -nc -np -nd --content-disposition --wait=1 --tries=5 "https://prospektbestellung.nordseetourismus.de/mediafiles/Sonstiges/Ortsprospekte/amrum2021.pdf"
I believe there is network or firewall issue.
CodePudding user response:
When using WGET its sending it's own headers and the only one that will be different from the browser is the user-agent.
You can pick the user-agent from your browser or just get a random one online and set it as a header during the request.
CodePudding user response:
A python based solution below
import requests
url = 'https://prospektbestellung.nordseetourismus.de/mediafiles/Sonstiges/Ortsprospekte/amrum2021.pdf'
r = requests.get(url)
with open('my_file.pdf', 'wb') as f:
f.write(r.content)
CodePudding user response:
any other solutions, preferably in Python, which I could consider?
You might use urllib.request.urlretrieve
from built-in module urllib.request
as follows
import urllib.request
urllib.request.urlretrieve("https://prospektbestellung.nordseetourismus.de/mediafiles/Sonstiges/Ortsprospekte/amrum2021.pdf","amrum2021.pdf")
this code does download file and save it under name amrum2021.pdf
in current working directory. Unlike requests
urllib.request
is built-in module so no additional installation beyond python itself is required.