How to distinguish two tables with the same relative XPATH with Selenium in Python-CodePudding

I'm trying to scrape some data from IMDb (with selenium in Python), but I have a problem. For each movie I have to fetch directors and writers. Both elements are contained in two tables and they have the same @class. I need to distinguish the two tables when I scrape, otherwise sometimes the program could fetch a writer as a director and vice versa.

I've tried to use relative XPATH to find all elements (tables) with that xpath and then put them in a loop where I try to distinguish them trough table title (that is a h4 element) and preceding-sibling function. The code works, but it do not find anything (everytime it returns nan).

This is my code:

    counter = 1
    try:
        driver.get('https://www.imdb.com/title/'   tt   '/fullcredits/?ref_=tt_cl_sm')
        ssleep()
        tables = driver.find_elements(By.XPATH, '//table[@]/tbody')
        counter = 1
        for table in tables:
            xpath_table = f'//table[@]/tbody[{counter}]' 
            xpath_h4 = xpath_table   "/preceding-sibling::h4[1]/text()"
            table_title = driver.find_element(By.XPATH, xpath_h4).text
            if table_title == "Directed by":
                rows_director = table.find_elements(By.CSS_SELECTOR, 'tr')
                for row in rows_director:
                    director = row.find_elements(By.CSS_SELECTOR, 'a')
                    director = [x.text for x in director]
                    if len(director) == 1:
                        director = ''.join(map(str, director))
                    else:
                        director = ', '.join(map(str, director))
                        director_list.append(director)
        counter  = 1

    except NoSuchElementException:
        # director = np.nan
        director_list.append(np.nan)

Can any of you tell me why it doesn't work? Perhaps there is a better solution. I hope for your help.

(here you can find an example of the page I need to scrape: https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm)

CodePudding user response：

To extract the names and directors and writers of each movie within an imdb.com you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use the following locator strategies:

Using CSS_SELECTOR:

driver.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4#director  table > tbody tr > td > a")))])
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4#writer  table > tbody tr > td > a")))])

Using XPATH:

driver.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[@id='director']//following::table[1]/tbody//tr/td/a")))])
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[@id='writer']//following::table[1]/tbody//tr/td/a")))])

Console Output:

['Matt Reeves']
['Matt Reeves', 'Peter Craig', 'Bill Finger', 'Bob Kane']

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

CodePudding user response：

You can use the id attribute of h4 tags of the Directors and Writers to extract the data.

Try like below:

# Imports Required
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

links = ["https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm","https://www.imdb.com/title/tt10234724/fullcredits/?ref_=tt_cl_sm",
         "https://www.imdb.com/title/tt10872600/fullcredits?ref_=tt_cl_wr_sm","https://www.imdb.com/title/tt1160419/fullcredits?ref_=tt_cl_wr_sm"]

for link in links:
    driver.get(link)
    wait = WebDriverWait(driver,20)
    
    # Get the name of the movie
    name = wait.until(EC.presence_of_element_located((By.XPATH,"//h3[@itemprop='name']/a"))).text
    
    # Get the Directors
    directors = driver.find_elements(By.XPATH,"//h4[@id='director']/following-sibling::table[1]//tr")
    dir_list = []
    for director in directors:
        # Add the director names in the list. You can format the unwanted string using replace.
        dir_list.append(director.text)

    # Get the Writers
    writers = driver.find_elements(By.XPATH,"//h4[@id='writer']/following-sibling::table[1]//tr")
    wri_list = []
    for writer in writers:
        # Add the Writer names in the list. You can format the unwanted string using replace.
        wri_list.append(writer.text)

    # Print the data.
    print(f"Name of the movie: {name}")
    print(f"Directors : {dir_list}")
    print(f"Writers : {wri_list}")

Output:

Name of the movie: The Batman
Directors : ['Matt Reeves ... (directed by)']
Writers : ['Matt Reeves ... (written by) &', 'Peter Craig ... (written by)', ' ', 'Bill Finger ... (Batman created by) &', 'Bob Kane ... (Batman created by)']
Name of the movie: Moon Knight
Directors : ['Justin Benson ... (5 episodes, 2022)', 'Mohamed Diab ... (5 episodes, 2022)', 'Aaron Moorhead ... (5 episodes, 2022)']
Writers : ['Danielle Iman ... (staff writer) (6 episodes, 2022)', 'Doug Moench ... (characters) (6 episodes, 2022)', 'Doug Moench ... (creator) (6 episodes, 2022)', 'Don Perlin ... (characters) (6 episodes, 2022)', 'Jeremy Slater ... (created for television by) (6 episodes, 2022)', 'Jeremy Slater ... (6 episodes, 2022)', 'Peter Cameron ... (written by) (2 episodes, 2022)', 'Sabir Pirzada ... (written by) (2 episodes, 2022)', 'Beau DeMayo ... (written by) (1 episode, 2022)', 'Michael Kastelein ... (written by) (1 episode, 2022)', 'Alex Meenehan ... (written by) (1 episode, 2022)', 'Jack Kirby ... (Based on the Marvel comics by) (unknown episodes)', 'Stan Lee ... (Based on the Marvel comics by) (unknown episodes)']
Name of the movie: Spider-Man: No Way Home
Directors : ['Jon Watts']
Writers : ['Chris McKenna ... (written by) &', 'Erik Sommers ... (written by)', ' ', 'Stan Lee ... (based on the Marvel comic book by) and', 'Steve Ditko ... (based on the Marvel comic book by)']
Name of the movie: Dune
Directors : ['Denis Villeneuve ... (directed by)']
Writers : ['Jon Spaihts ... (screenplay by) and', 'Denis Villeneuve ... (screenplay by) and', 'Eric Roth ... (screenplay by)', ' ', 'Frank Herbert ... (based on the novel Dune written by)']

CodePudding user response：

Since it is static page content you don't even need selenium. you can use light weight python requests module and Bs4.It just an another approach.

import requests
from bs4 import BeautifulSoup

res=requests.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm")
result=res.text
soup=BeautifulSoup(result, 'html.parser')
directors=[director.text.strip() for director in soup.select("h4#director table tr td.name>a")]
writers=[writer.text.strip() for writer in soup.select("h4#writer table tr td.name>a")]

print(directors)
print(writers)

Output:

['Matt Reeves']
['Matt Reeves', 'Peter Craig', 'Bill Finger', 'Bob Kane']