Removing white space from URL Python-CodePudding

I have a URL that has white space in the beginning. I have to remove it before passing it to urllib.request.urlretrieve.

pdflink = ' https://www.doj.nh.gov/consumer/security-breaches/documents/a2z-field-services-20201218.pdf'

But, I am not able to remove it.

What I have tried till now:

pdflink.lstrip() : not working and I do not know why?
pdflink.replace(' ', '') : not working

Any idea how to remove it?

My final code:

import openpyxl

wb = openpyxl.load_workbook('Data.xlsx')
ws = wb['Final']

pdflink = (ws.cell(row=4487,column=4).value).lstrip()

# pdflink will have value as shown below:
#pdflink = ' https://www.doj.nh.gov/consumer/security-breaches/documents/a2z-field-services-20201218.pdf'
try:
        urllib.request.urlretrieve(pdflink, 'test')
        return True
except FileNotFoundError:
        print(filename   ' Not present')
        return False

Running above code will throw error as: URLError: urlopen error unknown url type: https

Root cause of the error: Additional white space in the beginning of the URL.

CodePudding user response：

It's not just a space. You have some non-printing special character in there as the first character. I can't tell which one, but when I cut-and-paste from your post, I get an extra character. You might try print(ord(pdflink[0])) to see what it is. You may need to use pdflink = pdflink[2:] to clean it out. Or, search for the http:

    i = pdflink.find('http')
    pdflink = pdflink[i:]

CodePudding user response：

There is actually a unicode character FEFF in that link prior to the space. You can't see it, but it is breaking your lstrip

You can see it here

I would suggest using pdflink.split(' ')[-1]

CodePudding user response：

There's possibly some weird character at the beginning of the string - I would try to open the file you're trying to read with specified encoding="UTF-8".

I also solved this issue using pdflink.lstrip(" ") where inside the quotes I copy-pasted that non-space thing from your original string