I am trying to web scrape a list of YouTube videos and I want to collect each video's YouTube description. However, I am unsuccessful and do not understand why so. Any help is much appreciated. (Youtube video in question: https://www.youtube.com/watch?v=57Tjvv_pCXg&t=55s)
element_titles = driver.find_elements_by_id("video-title")
result = requests.get(element_titles[1].get_attribute("href"))
soup = BeautifulSoup(result.content)
description = str(soup.find("div", {"class": "style-scope yt-formatted-string"}))
The results of the decription is None
Note I understand that there exists a Youtube API however you must pay for an API key and it is not in my interest to do so
CodePudding user response:
To extract the description you can use both selenium or beautifulsoup. The latter is faster, here is the code
import re
soup = BeautifulSoup(requests.get('https://www.youtube.com/watch?v=57Tjvv_pCXg').content)
pattern = re.compile('(?<=shortDescription":").*(?=","isCrawlable)')
description = pattern.findall(str(soup))[0].replace('\\n','\n')
print(description)
If you run print(soup.prettify())
and look for a part of the video description, say know this is just my
, you will see that the complete description is inside a big json structure
...,"isOwnerViewing":false,"shortDescription":"Listen: https://quellechris360.bandcamp.com/album/deathfame\n\nQuelle Chris delivers what might be his most challengi...bla bla...ABSTRACT HIP HOP\n\n7/10\n\nY'all know this is just my opinion, right?","isCrawlable":true,"thumbnail":{...
In particular the description is included between shortDescription":"
and ","isCrawlable
, so we can use regex to extract the substring included between these two strings. The regex command to find every character (.*
) included between the two strings is (?<=shortDescription":").*(?=","isCrawlable)