Home > Blockchain >  I've issues with getting data with BeautifulSoup
I've issues with getting data with BeautifulSoup

Time:03-11

import requests
from bs4 import BeautifulSoup

URL = "https://www.empireonline.com/movies/features/best-movies-2/"    

response = requests.get(URL)
website_html = response.text

soup = BeautifulSoup(website_html, "html.parser")

all_movies = soup.find_all(name="h3", class_="jsx-4245974604")

movie_titles = [movie.getText() for movie in all_movies]
# movies = movie_titles[::-1]
print(movie_titles)

This code should be show Movie's title But I cant understand what's wrong?

CodePudding user response:

You could use Selenium to let the page render then parse the html as stated by F.Hoque. however, the data is present in the <script> tags in json format. It would be more efficient to go after the data there:

import requests
from bs4 import BeautifulSoup
import re
import json

URL = "https://www.empireonline.com/movies/features/best-movies-2/"    

response = requests.get(URL)
website_html = response.text

soup = BeautifulSoup(website_html, "html.parser")

script = soup.find('script', {'id':'__NEXT_DATA__'})
jsonStr = re.search('({.*})', str(script)).group(1)
jsonData = json.loads(jsonStr)['props']['pageProps']['apolloState']

movie_titles = [v['titleText'].split(') ')[-1] for k, v in jsonData.items() if 'ImageMeta' in k]

# If you want the rank/number use the line below
#movie_titles = [v['titleText'] for k, v in jsonData.items() if 'ImageMeta' in k]
print(movie_titles)

Output:

print(movie_titles)
['Reservoir Dogs', 'Groundhog Day', 'Paddington 2', 'Amelie', 'Brokeback Mountain', 'Donnie Darko', 'Scott Pilgrim Vs. The World', 'Portrait Of A Lady On Fire', 'Léon', 'Logan', 'The Terminator', 'No Country For Old Men', 'Titanic', 'The Exorcist', 'Black Panther', 'Shaun Of The Dead', 'Lost In Translation', 'Thor: Ragnarok', 'The Usual Suspects', 'Psycho', 'L.A. Confidential', 'E.T. – The Extra Terrestrial', 'In The Mood For Love', 'Star Wars: Return Of The Jedi', 'Arrival', 'A Quiet Place', 'Trainspotting', 'Mulholland Drive', 'Rear Window', 'Up', 'Spider-Man: Into The Spider-Verse', 'Inglourious Basterds', 'Lady Bird', "Singin' In The Rain", "One Flew Over The Cuckoo's Nest", 'Seven Samurai', 'La La Land', 'Get Out', 'Lawrence Of Arabia', "Pan's Labyrinth", 'Hot Fuzz', 'Moonlight', 'Guardians Of The Galaxy', 'Blade Runner 2049', 'The Social Network', 'Taxi Driver', 'Saving Private Ryan', 'Forrest Gump', ' Point Break', 'Whiplash', 'Vertigo', 'Spirited Away', 'Ghostbusters', 'Do The Right Thing', "Schindler's List", 'The Big Lebowski', "It's A Wonderful Life", 'There Will Be Blood', '12 Angry Men', 'The Silence Of The Lambs', ' Citizen Kane', 'Gladiator', 'The Good, The Bad And The Ugly', 'Se7en', 'Eternal Sunshine Of The Spotless Mind', 'The Shining', 'The Lord Of The Rings: The Two Towers', 'Casablanca', 'The Thing', 'Interstellar', 'Heat', 'Apocalypse Now', 'Indiana Jones And The Last Crusade', 'The Lord Of The Rings The Return Of The King', 'Die Hard', 'Fight Club', 'Terminator 2 Judgment Day', '2001: A Space Odyssey', 'Avengers: Endgame', 'Alien', 'The Matrix', 'Inception', 'Parasite', 'Aliens', 'Blade Runner', 'Jurassic Park', 'The Godfather Part II', 'Back To The Future', 'Mad Max: Fury Road', 'Star Wars', 'Goodfellas', 'Raiders Of The Lost Ark', 'Avengers: Infinity War', 'Pulp Fiction', 'Jaws', 'The Shawshank Redemption', 'The Dark Knight', 'The Godfather', 'Star Wars: The Empire Strikes Back', 'The Lord Of The Rings: The Fellowship Of The Ring']

CodePudding user response:

The code that you've written is correct but the selected element is dynamic meaning populated by JavaScript and BeautifulSoup can't grab it. That's why You need automation tool something like selenium with BeautifulSoup`. Please just run the code.

Script:

from bs4 import BeautifulSoup
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd

url = 'https://www.empireonline.com/movies/features/best-movies-2/'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(10)

soup = BeautifulSoup(driver.page_source, 'html.parser')
#driver.close(

all_movies = soup.find_all(name="h3", class_="jsx-4245974604")

movie_titles = [movie.getText() for movie in all_movies]
# movies = movie_titles[::-1]
print(movie_titles)

Output:

['100) Reservoir Dogs', '99) Groundhog Day', '98) Paddington 2', '97) Amelie', '96) Brokeback 
Mountain', '95) Donnie Darko', '94) Scott Pilgrim Vs. The World', '93) Portrait Of A Lady On Fire', '92) Léon', '91) Logan', '90) The Terminator', '89) No Country For Old Men', '88) Titanic', '87) The Exorcist', '86) Black Panther', '85) Shaun Of The Dead', '84) Lost In Translation', '83) Thor: Ragnarok', '82) The Usual Suspects', '81) Psycho', '80) L.A. Confidential', '79) E.T. – The Extra Terrestrial', '78) In The Mood For Love', '77) Star Wars: Return Of The Jedi', '76) Arrival', '75) A Quiet Place', '74) Trainspotting', '73) Mulholland Drive', '72) Rear 
Window', '71) Up', '70) Spider-Man: Into The Spider-Verse', '69) Inglourious Basterds', '68) Lady Bird', "67) Singin' In The Rain", "66) One Flew Over The Cuckoo's Nest", '65) Seven Samurai', '64) La La Land', '63) Get Out', '62) Lawrence Of Arabia', "61) Pan's Labyrinth", '60) Hot Fuzz', '59) Moonlight', '58) Guardians Of The Galaxy', '57) Blade Runner 2049', '56) The Social Network', '55) Taxi Driver', '54) Saving Private Ryan', '53) Forrest Gump', '52)  Point Break', '51) Whiplash', '50) Vertigo', '49) Spirited Away', '48) Ghostbusters', '47) Do The Right Thing', "46) Schindler's List", '45) The Big Lebowski', "44) It's A Wonderful Life", '43) There Will Be Blood', '42) 12 Angry Men', '41) The Silence Of The Lambs', '40)  Citizen Kane', '39) Gladiator', '38) The Good, The Bad And The Ugly', '37) Se7en', '36) Eternal Sunshine Of The Spotless Mind', '35) The Shining', '34) The Lord Of The Rings: The Two Towers', '33) Casablanca', '32) The Thing', '31) Interstellar', '30) Heat', '29) Apocalypse Now', '28) Indiana Jones And The Last Crusade', '27) The Lord Of The Rings The Return Of The King', '26) Die Hard', '25) Fight Club', '24) Terminator 2 Judgment Day', '23) 2001: A Space Odyssey', '22) Avengers: Endgame', '21) Alien', '20) The Matrix', '19) Inception', '18) Parasite', '17) Aliens', '16) Blade Runner', '15) Jurassic Park', '14) The Godfather Part II', '13) Back To The Future', '12) 
Mad Max: Fury Road', '11) Star Wars', '10) Goodfellas', '9) Raiders Of The Lost Ark', '8) Avengers: Infinity War', '7) Pulp Fiction', '6) Jaws', '5) The Shawshank Redemption', '4) The Dark Knight', '3) The Godfather', '2) Star Wars: The Empire Strikes Back', '1) The Lord Of The Rings: The Fellowship Of The Ring']

CodePudding user response:

Yeah i tried name=div too.. but it does not help me

  • Related