Hello so i try to scrape off the author image url from the given video link using the urllib3 module, but due to different lengths of the url it causes to join other properties like the width and height
https://yt3.ggpht.com/ytc/AKedOLS-Bwwebj7zfYDDo43sYPxD8LN7q4Lq4EvqfyoDbw=s400-c-k-c0x00ffffff-no-rj","width":400,"height"
instead of this author image link which i want :
https://yt3.ggpht.com/ytc/AKedOLS-Bwwebj7zfYDDo43sYPxD8LN7q4Lq4EvqfyoDbw=s400-c-k-c0x00ffffff-no-rj
the code that i worked
import re
import urllib.request
def get_author(uri):
html = urllib.request.urlopen(uri)
author_image = re.findall(r'yt3.ggpht.com/(\S{99})', html.read().decode())
return f"https://yt3.ggpht.com/{author_image[1]}"
sorry for my bad english, thanks in advance =)
CodePudding user response:
If you are not sure about the length of the match, do not hardcode the amount of chars to be matched. {99}
is not going to work with arbitrary strings.
Besides, you want to match the string in a mark-up text and you need to be sure you only match until the delimiting char. If it is a "
char, then match until that character.
Also, dots in regex are special and you need to escape them to match literal dots.
Besides, findall
is used to match all occurrences, you can use re.search
to get the first one to free up some resources.
So, a fix could look like
def get_author(uri):
html = urllib.request.urlopen(uri)
author_image = re.search(r'yt3\.ggpht\.com/[^"] ', html.read().decode())
if author_image:
return f"https://{author_image.group()}"
return None # you will need to handle this condition in your later code
Here, author_image
is the regex match data object, and if it matches, you need to prepend the match value (author_image.group()
) with https://
and return the value, else, you need to return some default value to check later in the code (here, None
).