Home > Enterprise >  How to scrape Director information using Beautifulsoup
How to scrape Director information using Beautifulsoup

Time:03-24

I want scrape the Director from the "Cast and Crew" tab of this domain https://www.the-numbers.com/movie/Avengers-Endgame-(2019)#tab=cast-and-crew using Beautiful soup.

I tried this soup.find_all('div', id="cast-and-crew") but had no success.

CodePudding user response:

I am new in this like you, I tried and with beatifulsoap it does get the request, maybe some type of security, but I tried to do what you want with selenium and it works, check this:

from selenium import webdriver

website = "https://www.the-numbers.com/movie/Avengers-Endgame-(2019)#tab=cast-and-crew"
path = "/"

chrome_options = webdriver.ChromeOptions(); 
chrome_options.add_experimental_option("excludeSwitches", ['enable-logging'])
driver = webdriver.Chrome(options=chrome_options);  
driver.get(website)

box = driver.find_element_by_class_name("cast_new")

matches = box.find_elements_by_xpath('//*[@id="cast-and-crew"]/div[5]/table/tbody/tr[1]/td[1]/b/a')

for match in matches:
    print(match.text)

driver.quit()

CodePudding user response:

I can get data only if I add header User-Agent

from bs4 import BeautifulSoup as BS
import requests

headers = { 
 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0', 
}

url = 'https://www.the-numbers.com/movie/Avengers-Endgame-(2019)#tab=cast-and-crew'

response = requests.get(url, headers=headers)

# --- response ---

#print(response.status_code)
#print(response.text[:1000])

soup = BS(response.text, 'html.parser')

all_items = soup.find_all('div', id="cast-and-crew") 
for item in all_items:
    print(item.get_text(strip=True, separator='\n'))

Result:

Lead Ensemble Members
Robert Downey, Jr.
Tony Stark/Iron Man
Chris Evans
Steve Rogers/Captain America
Mark Ruffalo
Bruce Banner/Hulk
Chris Hemsworth
Thor
Scarlett Johansson
Natasha Romanoff/Black Widow
Jeremy Renner
Clint Barton/Hawkeye
Don Cheadle
...
  • Related