Home > other >  How to scrape youtube video description with Beautiful Soup
How to scrape youtube video description with Beautiful Soup

Time:05-24

I am trying to web scrape a list of YouTube videos and I want to collect each video's YouTube description. However, I am unsuccessful and do not understand why so. Any help is much appreciated. (Youtube video in question: https://www.youtube.com/watch?v=57Tjvv_pCXg&t=55s)

element_titles = driver.find_elements_by_id("video-title")
result = requests.get(element_titles[1].get_attribute("href"))
soup = BeautifulSoup(result.content)
description = str(soup.find("div", {"class": "style-scope yt-formatted-string"}))

The results of the decription is None

Note I understand that there exists a Youtube API however you must pay for an API key and it is not in my interest to do so

CodePudding user response:

To extract the description you can use both selenium or beautifulsoup. The latter is faster, here is the code

import re
soup = BeautifulSoup(requests.get('https://www.youtube.com/watch?v=57Tjvv_pCXg').content)
pattern = re.compile('(?<=shortDescription":").*(?=","isCrawlable)')
description = pattern.findall(str(soup))[0].replace('\\n','\n')
print(description)

If you run print(soup.prettify()) and look for a part of the video description, say know this is just my, you will see that the complete description is inside a big json structure

...,"isOwnerViewing":false,"shortDescription":"Listen: https://quellechris360.bandcamp.com/album/deathfame\n\nQuelle Chris delivers what might be his most challengi...bla bla...ABSTRACT HIP HOP\n\n7/10\n\nY'all know this is just my opinion, right?","isCrawlable":true,"thumbnail":{...

In particular the description is included between shortDescription":" and ","isCrawlable, so we can use regex to extract the substring included between these two strings. The regex command to find every character (.*) included between the two strings is (?<=shortDescription":").*(?=","isCrawlable)

  • Related