Need to extract Genre from the Movie page on IMDb-CodePudding

I need to extract a list of all the genres of any given movie from the movie page on IMDb.

For example:

Movie page: https://www.imdb.com/title/tt0454848/?ref_=adv_li_i
List of Genres: [Crime, Drama, Mystery, Thriller]

I tried using Beautiful Soup but I am not able to find the exact class under which the genres are stored.

Following are the snippets I tried:

ul= soup.find("ul", {"class": "ipc-metadata-list ipc-metadata-list--dividers-all sc-388740f9-1 IjgYL ipc-metadata-list--base"})
children = ul.findChildren("a", recursive=False)

This throws an error saying AttributeError: 'NoneType' object has no attribute 'findChildren'

class_selector = "ipc-inline-list__item" 
genre = soup.find_all('li', {'class': class_selector})
list1 = []
for tag in list1:
   list1.append(tag.find('a')).text
print(list1)

This return a list with no entries

Any help would be great!

Image of the website source code

CodePudding user response：

You probably don't need the overheads of selenium/chromedriver setup, instead you can do it with requests:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.imdb.com/title/tt0454848/?ref_=adv_li_i')
soup = BeautifulSoup(r.text, 'html.parser')
genres = soup.select_one('div.ipc-chip-list__scroller')
for genre in genres.contents:
    print(genre.text)

This prints out:

Crime
Drama
Mystery

BeautifulSoup documentation can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

UPDATE: To get the desired genre list, you can use selenium only. I will include the full code below:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

url = 'https://www.imdb.com/title/tt0454848/?ref_=adv_li_i'
# url = 'https://www.imdb.com/title/tt0765429/'

browser.get(url)
browser.execute_script("window.scrollBy(0,2200);")
elem_pulled_from_graphql = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//div[@data-testid="storyline-plot-summary"]')))
genres = elem_pulled_from_graphql.find_elements(By.XPATH, "//a[@class='ipc-metadata-list-item__list-content-item--link']")
genres = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//span[text()='Genres']/following-sibling::div//child::li")))
for g in genres:
    print(g.text)

This will print out:

Crime
Drama
Mystery
Thriller

This solution is based on selenium only, and will wait as long as needed (well, up to 20 seconds) for the data to be pulled from database by the graphql query.

CodePudding user response：

According to your Screenshot, to get the list of genre, you can use selenium with bs4 as follows:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")


webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url='https://www.imdb.com/title/tt0454848/?ref_=adv_li_i'
driver.get(url)
driver.maximize_window()
time.sleep(5)

soup = BeautifulSoup(driver.page_source,'lxml')
t=soup.select_one('span.ipc-metadata-list-item__label:-soup-contains("Genres")').parent
genre=[x.get_text() for x in t.select('div[] > ul > li')]
print(genre)

Output:

['Crime', 'Drama', 'Mystery', 'Thriller']

CodePudding user response：

To extract list of all the genres i.e. bank, bank robber, hostage, robber and negotiation you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

Using CLASS_NAME and get_attribute("textContent"):

driver.execute("get", {'url': 'https://www.imdb.com/title/tt0454848/?ref_=adv_li_i'})
print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "ipc-chip--on-base")))])

Using CSS_SELECTOR and get_attribute("innerHTML"):

driver.execute("get", {'url': 'https://www.imdb.com/title/tt0454848/?ref_=adv_li_i'})
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.ipc-chip.ipc-chip--on-base span")))])

Using XPATH and text attribute:

driver.execute("get", {'url': 'https://www.imdb.com/title/tt0454848/?ref_=adv_li_i'})     
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@class='ipc-chip ipc-chip--on-base']//span")))])

Console Output:

['bank', 'bank robber', 'hostage', 'robber', 'negotiation']

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC