Home > Software engineering >  IMDb webscraping for the top 250 movies using Beautifulsoup
IMDb webscraping for the top 250 movies using Beautifulsoup

Time:10-10

I know that there are many similar questions here already, but none of them gives me a satisfying answer for my problem. So here it is:

We need to create a dataframe from the top 250 movies from IMDb for an assignment. So we need to scrape the data first using BeautifulSoup.

These are the attributes that we need to scrape:

IMDb id (0111161)
Movie name (The Shawshank Redemption)
Year (1994)
Director (Frank Darabont)
Stars (Tim Robbins, Morgan Freeman, Bob Gunton)
Rating (9.3)
Number of reviews (2.6M)
Genres (Drama)
Country (USA)
Language (English)
Budget ($25,000,000)
Gross box Office Revenue ($28,884,504)

So far, I have managed to get only a few of them. I received all the separate URLs for all the movies, and now I loop over them. This is how the loop looks so far:

for x in np.arange(0, len(top_250_links)):
    url=top_250_links[x]
    req = requests.get(url)
    page = req.text
    soup = bs(page, 'html.parser')
    
    # ID
    
    # Movie Name
    Movie_name=(soup.find("div",{'class':"sc-dae4a1bc-0 gwBsXc"}).get_text(strip=True).split(': ')[1])
    
    # Year
    year =(soup.find("a",{'class':"ipc-link ipc-link--baseAlt ipc-link--inherit-color sc-8c396aa2-1 WIUyh"}).get_text())
    
    # Length
    
    
    # Director
    director = (soup.find("a",{'class':"ipc-metadata-list-item__list-content-item"}).get_text())
    
    # Stars
    stars = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
    
    
    # Rating
    rating = (soup.find("span",{'class':"sc-7ab21ed2-1 jGRxWM"}).get_text())
    rating = float(rating)
        
    # Number of Reviews
    reviews = (soup.find("span",{'class':"score"}).get_text())
    reviews = reviews.split('K')[0]
    reviews = float(reviews)*1000
    reviews = int(reviews)
    
    # Genres
    genres = (soup.find("span",{'class':"ipc-chip__text"}).get_text())

    # Language
    
    
    # Country
    
    
    # Budget
    meta = (soup.find("div" ,{'class':"ipc-metadata-list-item__label ipc-metadata-list-item__label--link"}))
    
    
    # Gross box Office Revenue
    gross = (soup.find("span",{'class':"ipc-metadata-list-item__list-content-item"}).get_text())
    
    # Combine
    movie_dict={
        'Rank':x 1,
        'ID': 0,
        'Movie Name' : Movie_name,
        'Year' : year,
        'Length' : 0,
        'Director' : director,
        'Stars' : stars,
        'Rating' : rating,
        'Number of Reviewes' : reviews,
        'Genres' : genres,
        'Language': 0,
        'Country': 0,
        'Budget' : 0,
        'Gross box Office Revenue' :0}
    
    df = df.append(pd.DataFrame.from_records([movie_dict],columns=movie_dict.keys() ) )

I can't find a way to obtain the missing information. If anybody here has experience with this kind of topic and might be able to share his thoughts, it would help a lot of people. I think the task is not new and has been solved hundreds of times, but IMDb changed the classes and the structure in their HTML.

Thanks in advance.

CodePudding user response:

BeautifulSoup has many functions to search elements. it is good to read all documentation

You can create more complex code using many .find() with .parent, etc.

soup.find(text='Language').parent.parent.find('a').text

For some elements you can also use data-testid="...."

soup.find('li', {'data-testid': 'title-details-languages'}).find('a').text

Minimale working code (for The Shawshank Redemption)

import requests
from bs4 import BeautifulSoup as BS

url = 'https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=1a264172-ae11-42e4-8ef7-7fed1973bb8f&pf_rd_r=A453PT2BTBPG41Y0HKM8&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1'

response = requests.get(url)
soup = BS(response.text, 'html.parser')

print('Language:', soup.find(text='Language').parent.parent.find('a').get_text(strip=True))
print('Country of origin:', soup.find(text='Country of origin').parent.parent.find('a').get_text(strip=True))

for name in ('Language', 'Country of origin'):
    value = soup.find(text=name).parent.parent.find('a').get_text(strip=True)
    print(name, ':', value)

print('Language:', soup.find('li', {'data-testid':'title-details-languages'}).find('a').get_text(strip=True))
print('Country of origin:', soup.find('li', {'data-testid':'title-details-origin'}).find('a').get_text(strip=True))

for name, testid in ( ('Language', 'title-details-languages'), ('Country of origin', 'title-details-origin')):    
    value = soup.find('li', {'data-testid':testid}).find('a').get_text(strip=True)
    print(name, ':', value)

Result:

Language: English
Country of origin: United States

Language : English
Country of origin : United States

Language: English
Country of origin: United States

Language : English
Country of origin : United States

  • Related