Python: getting the youtube author image from the video link-CodePudding

Hello so i try to scrape off the author image url from the given video link using the urllib3 module, but due to different lengths of the url it causes to join other properties like the width and height

https://yt3.ggpht.com/ytc/AKedOLS-Bwwebj7zfYDDo43sYPxD8LN7q4Lq4EvqfyoDbw=s400-c-k-c0x00ffffff-no-rj","width":400,"height"

instead of this author image link which i want :

https://yt3.ggpht.com/ytc/AKedOLS-Bwwebj7zfYDDo43sYPxD8LN7q4Lq4EvqfyoDbw=s400-c-k-c0x00ffffff-no-rj

the code that i worked

import re
import urllib.request

def get_author(uri):
    html = urllib.request.urlopen(uri)
    author_image = re.findall(r'yt3.ggpht.com/(\S{99})', html.read().decode())
    return f"https://yt3.ggpht.com/{author_image[1]}"

sorry for my bad english, thanks in advance =)

CodePudding user response：

If you are not sure about the length of the match, do not hardcode the amount of chars to be matched. {99} is not going to work with arbitrary strings.

Besides, you want to match the string in a mark-up text and you need to be sure you only match until the delimiting char. If it is a " char, then match until that character.

Also, dots in regex are special and you need to escape them to match literal dots.

Besides, findall is used to match all occurrences, you can use re.search to get the first one to free up some resources.

So, a fix could look like

def get_author(uri):
    html = urllib.request.urlopen(uri)
    author_image = re.search(r'yt3\.ggpht\.com/[^"] ', html.read().decode())
    if author_image:
        return f"https://{author_image.group()}"
    return None # you will need to handle this condition in your later code

Here, author_image is the regex match data object, and if it matches, you need to prepend the match value (author_image.group()) with https:// and return the value, else, you need to return some default value to check later in the code (here, None).