Home > Enterprise >  How to extract the comments count correctly
How to extract the comments count correctly

Time:09-01

I am trying to extract number of youtube comments and tried several methods.

My Code:

from selenium import webdriver
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time

DRIVER_PATH = <your chromedriver path>
wd = webdriver.Chrome(executable_path=DRIVER_PATH)

url = 'https://www.youtube.com/watch?v=5qzKTbnhyhc'

wd.get(url)
wait = WebDriverWait(wd, 100)

time.sleep(40)
v_title = wd.find_element_by_xpath('//*[@id="container"]/h1/yt-formatted-string').text
print("title Is ")
print(v_title)

comments_xpath = '//h2[@id="count"]/yt-formatted-string/span[1]'
v_comm_cnt = wait.until(EC.visibility_of_element_located((By.XPATH, comments_xpath)))
#wd.find_element_by_xpath(comments_xpath)
print(len(v_comm_cnt))

I get the following error:

selenium.common.exceptions.TimeoutException: Message: 

I get correct value for title but not for comment_cnt. Can any one please guide me what is wrong with my code?

Please note that comments count path - //h2[@id="count"]/yt-formatted-string/span[1] point to correct place if I search the value in inspect element.

CodePudding user response:

Your locator is correct. The issue here is that the comments counter element is not initially loaded when the page is opened. You need to scroll the page down in order to load that element.
Also, I'm not sure you need to put the 40 seconds delay here, webdriver wait will be better to use here.
Also, don't forget to extract the amount of comments from the element containing that data.
So, I think this should work better:

from selenium import webdriver
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time

DRIVER_PATH = <your chromedriver path>
wd = webdriver.Chrome(executable_path=DRIVER_PATH)

url = 'https://www.youtube.com/watch?v=5qzKTbnhyhc'

wd.get(url)
wait = WebDriverWait(wd, 30)

v_title = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@id="container"]/h1/yt-formatted-string'))).text

print("Title is "   v_title)
comments_xpath = '//h2[@id="count"]/yt-formatted-string/span[1]'

wd.execute_script("return document.body.scrollHeight / 2")
v_comm_cnt = wait.until(EC.visibility_of_element_located((By.XPATH, comments_xpath))).text
print("Video has "   v_comm_cnt   " comments")

CodePudding user response:

Alternative solution using requests module as youtube provides its API.

import requests
import pandas as pd
payload = {"context":{"client":{"hl":"en","gl":"BD","remoteHost":"37.111.194.92","deviceMake":"","deviceModel":"","visitorData":"Cgs2cVM0MmxkT0hONCi7yL6YBg==","userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36,gzip(gfe)","clientName":"WEB","clientVersion":"2.20220831.00.00","osName":"Windows","osVersion":"10.0","originalUrl":"https://www.youtube.com/watch?v=5qzKTbnhyhc","platform":"DESKTOP","clientFormFactor":"UNKNOWN_FORM_FACTOR","configInfo":{"appInstallData":"CLvIvpgGENSDrgUQuIuuBRD2__0SEOK8rgUQy-z9EhD0__0SELfLrQUQ4rmuBRCTr64FEPy6rgUQkpWuBRC5xK4FENi-rQUQkfj8Eg=="},"timeZone":"Asia/Dhaka","browserName":"Chrome","browserVersion":"104.0.0.0","screenWidthPoints":541,"screenHeightPoints":657,"screenPixelDensity":1,"screenDensityFloat":1,"utcOffsetMinutes":360,"userInterfaceTheme":"USER_INTERFACE_THEME_LIGHT","connectionType":"CONN_CELLULAR_4G","memoryTotalKbytes":"4000000","mainAppWebInfo":{"graftUrl":"https://www.youtube.com/watch?v=5qzKTbnhyhc&ab_channel=LittleSoul","pwaInstallabilityStatus":"PWA_INSTALLABILITY_STATUS_CAN_BE_INSTALLED","webDisplayMode":"WEB_DISPLAY_MODE_BROWSER","isWebNativeShareAvailable":True}},"user":{"lockedSafetyMode":False},"request":{"useSsl":True,"internalExperimentFlags":[],"consistencyTokenJars":[]},"clickTracking":{"clickTrackingParams":"CAAQg2ciEwj2k-vv1vH5AhVrk9gFHdYmB8c="},"adSignalsInfo":{"params":[{"key":"dt","value":"1661969459779"},{"key":"flash","value":"0"},{"key":"frm","value":"0"},{"key":"u_tz","value":"360"},{"key":"u_his","value":"1"},{"key":"u_h","value":"768"},{"key":"u_w","value":"1366"},{"key":"u_ah","value":"728"},{"key":"u_aw","value":"1366"},{"key":"u_cd","value":"24"},{"key":"bc","value":"31"},{"key":"bih","value":"657"},{"key":"biw","value":"524"},{"key":"brdim","value":"0,0,0,0,1366,0,1366,728,541,657"},{"key":"vis","value":"1"},{"key":"wgl","value":"True"},{"key":"ca_type","value":"image"}]}},"continuation":"Eg0SCzVxektUYm5oeWhjGAYy6AMKvgNnZXRfcmFua2VkX3N0cmVhbXMtLUNvd0NDSUFFRlJlMzBUZ2FnUUlLX0FFSTJGOFFnQVFZQnlMeEFkTTVYR3g2VUNKTnVKQkRMaEFWNWNCZGNsUzE1SkNhSXZWWWZvUjNFbFNJX0RIaU01V0cxRTNNYk9NMGJHR0JSWVl5bE55Rng3NHViRjlCdi13cG5TcTBXeXVFdWJzRGpnYkxiZVoxbjJjSUpmY0JCVTgwWmlKWGJnZVlmTF9YaFMwNGN1cVc3M25FMEpPMUZYeUJRYkJGOVlBTlg5Q2NJVzNpM2gtYTlhSm84WU10cFZ3OHl1NG1UTS1CckIyTnVsWXJiVW9VNktHdWRkSU56S29NYUdJSnhDbmJ4aGl5cDA0cjB5WkNfSzNxaURIRlN1bXAta0tOajl4Y3Y5RVNDTXFRSUdORWJJQk93SlkzWDFQUTdrR0JoaWNsb2NLNnltMVFZUzJlM21iY1V5S09fU25DWWpwRFJfTUpEV3NjNXl3UUtCSUZDSWNnR0FBU0JRaUlJQmdBRWdjSWhTQVFLQmdCRWdVSWhpQVlBQklGQ0lrZ0dBQVNCd2lYSUJBbkdBRVlBQSIRIgs1cXpLVGJuaHloYzAAeAEoKEIQY29tbWVudHMtc2VjdGlvbg=="}  
api_url = "https://www.youtube.com/youtubei/v1/next?key=AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8&prettyPrint=false"
headers = {
    "User-Agent": "Mozilla/5.0",
    "Content-Type": "application/json"
    }

req = requests.post(api_url,headers=headers,json=payload)
lst=[]
item = req.json()['onResponseReceivedEndpoints'][0]['appendContinuationItemsAction']['continuationItems']
  
df = pd.json_normalize(item)
d = df.iloc[:,10:11]#.to_csv('out.csv',index=False)
print(d)
  

Output:

0   [{'text': 'This playlist is wonderful. Thank y...
1   [{'text': 'Everyone who is reading this I wish...
2   [{'text': 'This music '}, {'text': '           
  • Related