I'm sorry for my terrible English. I'm kinda new to Python. I would like to know, how to skip for loop process if Web element does not exists or fill it with other value? I've been trying to scrape youtube channel to get the title, views, and when video posted. My code looks like this:
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome import service
from selenium.webdriver.common.keys import Keys
import time
import wget
import os
import pandas as pd
import matplotlib.pyplot as plt
urls = [
'https://www.youtube.com/c/LofiGirl/videos',
'https://www.youtube.com/c/Miawaug/videos'
]
for url in urls:
PATH = 'C:\webdrivers\chromedriver.exe.'
driver = webdriver.Chrome(PATH)
driver.get(url)
#driver.maximize_window()
driver.implicitly_wait(10)
for i in range(10):
driver.find_element(By.TAG_NAME, "Body").send_keys(Keys.END)
driver.implicitly_wait(20)
time.sleep(5)
judul_video = []
viewers = []
tanggal_posting = []
titles = driver.find_elements(By.XPATH, "//a[@id='video-title']")
views = driver.find_elements(By.XPATH, "//div[@id='metadata-line']/span[1]")
DatePosted = driver.find_elements(By.XPATH, "//div[@id='metadata-line']/span[2]")
for title in titles:
judul_video.append(title.text)
driver.implicitly_wait(5)
for view in views:
viewers.append(view.text)
driver.implicitly_wait(5)
for posted in DatePosted:
tanggal_posting.append(posted.text)
driver.implicitly_wait(5)
vid_item = {
"video_title" : judul_video,
"views" : viewers,
"date_posted" : tanggal_posting
}
df = pd.DataFrame(vid_item, columns=["video_title", "views", "date_posted"])
#df_new = df.transpose()
print(df)
filename = url.split('/')[-2]
df.to_csv(rf"C:\Users\.......\YouTube_{filename}.csv", sep=",")
driver.quit()
That code works good, but at this code:
for posted in DatePosted:
tanggal_posting.append(posted.text)
driver.implicitly_wait(5)
when, some channel doing a live streaming, such as lofi Girl, I've got an error said "All arrays must be of the same length". Apparently, I had failed to create if else condition to fill streaming channel with other value such as Tanggal_posting.append("Live Stream") or else, or just skip entirely extraction data start from the title. This code below are trying to skip or filled with other value, but failed:
for posted in DatePosted:
if len(posted.text) > 0:
tanggal_posting.append(posted.text)
driver.implicitly_wait(5)
else:
tanggal_posting.append("Live")
driver.implicitly_wait(5)
How can I skip the iteration just for a single video that shown doing Live Stream? or how can I fill the value with other value such as "Live Stream" by using if else condition as I mention before? Thank you so much in Advance.
CodePudding user response:
Personally, I'd check first if posted
is viable for a .text attribute call.
for posted in DatePosted:
_posted = posted.text.strip() if posted else None
tanggal_posting.append(_posted if _posted else "Live")
driver.implicitly_wait(5)
Alternatively:
for posted in DatePosted:
_posted = posted.text.strip() if posted else None
if not _posted:
continue
tanggal_posting.append(_posted)
driver.implicitly_wait(5)
The overall code should differ depending on your objective. Though I suppose _posted
will be helpful in any of them.
CodePudding user response:
Instead of collecting 3 separate lists for each data item I'd suggest to get list of videos and then extract each item and handle then:
videos = driver.find_elements(By.XPATH, "//div[@id='items']/ytd-grid-video-renderer")
for video in videos:
if not video.find_elements(By.XPATH, ".//yt-icon"): # Check if no Streaming icon
title = video.find_element(By.XPATH, ".//a[@id='video-title']")
view = video.find_element(By.XPATH, ".//div[@id='metadata-line']/span[1]")
DatePosted = video.find_element(By.XPATH, ".//div[@id='metadata-line']/span[2]")
Note that you need to call driver.implicitly_wait(<SECONDS>)
only ONCE at the beginning of script!