I have a URL that has white space in the beginning. I have to remove it before passing it to urllib.request.urlretrieve.
pdflink = ' https://www.doj.nh.gov/consumer/security-breaches/documents/a2z-field-services-20201218.pdf'
But, I am not able to remove it.
What I have tried till now:
pdflink.lstrip()
: not working and I do not know why?pdflink.replace(' ', '')
: not working
Any idea how to remove it?
My final code:
import openpyxl
wb = openpyxl.load_workbook('Data.xlsx')
ws = wb['Final']
pdflink = (ws.cell(row=4487,column=4).value).lstrip()
# pdflink will have value as shown below:
#pdflink = ' https://www.doj.nh.gov/consumer/security-breaches/documents/a2z-field-services-20201218.pdf'
try:
urllib.request.urlretrieve(pdflink, 'test')
return True
except FileNotFoundError:
print(filename ' Not present')
return False
Running above code will throw error as: URLError: urlopen error unknown url type: https
Root cause of the error: Additional white space in the beginning of the URL.
CodePudding user response:
It's not just a space. You have some non-printing special character in there as the first character. I can't tell which one, but when I cut-and-paste from your post, I get an extra character. You might try print(ord(pdflink[0]))
to see what it is. You may need to use pdflink = pdflink[2:]
to clean it out. Or, search for the http
:
i = pdflink.find('http')
pdflink = pdflink[i:]
CodePudding user response:
There is actually a unicode character FEFF
in that link prior to the space. You can't see it, but it is breaking your lstrip
You can see it here
I would suggest using pdflink.split(' ')[-1]
CodePudding user response:
There's possibly some weird character at the beginning of the string - I would try to open the file you're trying to read with specified encoding="UTF-8"
.
I also solved this issue using pdflink.lstrip(" ")
where inside the quotes I copy-pasted that non-space thing from your original string