Home > OS >  How to extract title name and rating of a movie from IMDB database?
How to extract title name and rating of a movie from IMDB database?

Time:01-12

I'm very new to web scrapping in python. I want to extract the movie name, release year, and ratings from the IMDB database. This is the website for IMBD with 250 movies and ratings https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I use the module, BeautifulSoup, and request. Here is my code

movies = bs.find('tbody',class_='lister-list').find_all('tr')

When I tried to extract the movie name, rating & year, I got the same attribute error for all of them.

<td >
 <a href="/title/tt11564570/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=ea4e08e1-c8a3-47b5-ac3a-75026647c16e&amp;pf_rd_r=BQWZRBFAM81S7K6ZBPJP&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=moviemeter&amp;ref_=chtmvm_tt_1" title="Rian Johnson (dir.), Daniel Craig, Edward Norton">Glass Onion: une histoire à couteaux tirés</a>
 <span >(2022)</span>
 <div >1
 <span >(
 <span ></span>
 1)</span>

 <td >
 <strong title="7,3 based on 207 962 user ratings">7,3</strong>strong text


title = movies.find('td',class_='titleColumn').a.text
rating = movies.find('td',class_='ratingColumn imdbRating').strong.text
year = movies.find('td',class_='titleColumn').span.text.strip('()')

AttributeError Traceback (most recent call last) <ipython-input-9-2363bafd916b> in <module> ----> 1 title = movies.find('td',class_='titleColumn').a.text 2 title

~\anaconda3\lib\site-packages\bs4\element.py in getattr(self, key) 2287 def getattr(self, key): 2288 """Raise a helpful exception to explain a common code fix.""" -> 2289 raise AttributeError( 2290 "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key 2291 )

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Can someone help me to solve the problem? Thanks in advance!

CodePudding user response:

To get the ResultSets as list, you can try the next example.

from bs4 import BeautifulSoup
import requests
import pandas as pd

data = []

res = requests.get("https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I")
#print(res)
soup = BeautifulSoup(res.content, "html.parser")

for card in soup.select('.chart.full-width tbody tr'):
    data.append({
        "title": card.select_one('.titleColumn a').get_text(strip=True),
        "year": card.select_one('.titleColumn span').text,
        'rating': card.select_one('td[]').get_text(strip=True)
            })

df = pd.DataFrame(data)
print(df)
#df.to_csv('out.csv', index=False)

Output:

                                            title       year rating
0                            Avatar: The Way of Water  (2022)    7.9
1                                         Glass Onion  (2022)    7.2
2                                            The Menu  (2022)    7.3
3                                         White Noise  (2022)    5.8
4                                   The Pale Blue Eye  (2022)    6.7
..                                                ...     ...    ...
95                                          Zoolander  (2001)    6.5
96                      Once Upon a Time in Hollywood  (2019)    7.6
97  The Lord of the Rings: The Fellowship of the Ring  (2001)    8.8
98                                     New Year's Eve  (2011)    5.6
99                            Spider-Man: No Way Home  (2021)    8.2

[100 rows x 3 columns]

Update: To extract data using find_all and find method.

from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0'}

data = []

res = requests.get("https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I")
#print(res)
soup = BeautifulSoup(res.content, "html.parser")

for card in soup.table.tbody.find_all("tr"):
    data.append({
        "title": card.find("td",class_="titleColumn").a.get_text(strip=True),
        "year": card.find("td",class_="titleColumn").span.get_text(strip=True),
        'rating': card.find('td',class_="ratingColumn imdbRating").get_text(strip=True)
            })

df = pd.DataFrame(data)
print(df)

CodePudding user response:

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

find_all returns an array, meaning that movies is an array. You need to iterate over the array with for movie in movies:

for movie in movies:
title = movie.find('td',class_='titleColumn').a.text
rating = movie.find('td',class_='ratingColumn imdbRating').strong.text
year = movie.find('td',class_='titleColumn').span.text.strip('()')
  • Related