How to extract title name and rating of a movie from IMDB database?-CodePudding

I'm very new to web scrapping in python. I want to extract the movie name, release year, and ratings from the IMDB database. This is the website for IMBD with 250 movies and ratings https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I use the module, BeautifulSoup, and request. Here is my code

movies = bs.find('tbody',class_='lister-list').find_all('tr')

When I tried to extract the movie name, rating & year, I got the same attribute error for all of them.

<td >
 <a href="/title/tt11564570/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=ea4e08e1-c8a3-47b5-ac3a-75026647c16e&amp;pf_rd_r=BQWZRBFAM81S7K6ZBPJP&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=moviemeter&amp;ref_=chtmvm_tt_1" title="Rian Johnson (dir.), Daniel Craig, Edward Norton">Glass Onion: une histoire à couteaux tirés</a>
 <span >(2022)</span>
 <div >1
 <span >(
 <span ></span>
 1)</span>

 <td >
 <strong title="7,3 based on 207 962 user ratings">7,3</strong>strong text


title = movies.find('td',class_='titleColumn').a.text
rating = movies.find('td',class_='ratingColumn imdbRating').strong.text
year = movies.find('td',class_='titleColumn').span.text.strip('()')

AttributeError Traceback (most recent call last) <ipython-input-9-2363bafd916b> in <module> ----> 1 title = movies.find('td',class_='titleColumn').a.text 2 title

~\anaconda3\lib\site-packages\bs4\element.py in getattr(self, key) 2287 def getattr(self, key): 2288 """Raise a helpful exception to explain a common code fix.""" -> 2289 raise AttributeError( 2290 "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key 2291 )

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Can someone help me to solve the problem? Thanks in advance!

CodePudding user response：

To get the ResultSets as list, you can try the next example.

from bs4 import BeautifulSoup
import requests
import pandas as pd

data = []

res = requests.get("https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I")
#print(res)
soup = BeautifulSoup(res.content, "html.parser")

for card in soup.select('.chart.full-width tbody tr'):
    data.append({
        "title": card.select_one('.titleColumn a').get_text(strip=True),
        "year": card.select_one('.titleColumn span').text,
        'rating': card.select_one('td[]').get_text(strip=True)
            })

df = pd.DataFrame(data)
print(df)
#df.to_csv('out.csv', index=False)

Output:

                                            title       year rating
0                            Avatar: The Way of Water  (2022)    7.9
1                                         Glass Onion  (2022)    7.2
2                                            The Menu  (2022)    7.3
3                                         White Noise  (2022)    5.8
4                                   The Pale Blue Eye  (2022)    6.7
..                                                ...     ...    ...
95                                          Zoolander  (2001)    6.5
96                      Once Upon a Time in Hollywood  (2019)    7.6
97  The Lord of the Rings: The Fellowship of the Ring  (2001)    8.8
98                                     New Year's Eve  (2011)    5.6
99                            Spider-Man: No Way Home  (2021)    8.2

[100 rows x 3 columns]

Update: To extract data using find_all and find method.

from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0'}

data = []

res = requests.get("https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I")
#print(res)
soup = BeautifulSoup(res.content, "html.parser")

for card in soup.table.tbody.find_all("tr"):
    data.append({
        "title": card.find("td",class_="titleColumn").a.get_text(strip=True),
        "year": card.find("td",class_="titleColumn").span.get_text(strip=True),
        'rating': card.find('td',class_="ratingColumn imdbRating").get_text(strip=True)
            })

df = pd.DataFrame(data)
print(df)

CodePudding user response：

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

find_all returns an array, meaning that movies is an array. You need to iterate over the array with for movie in movies:

for movie in movies:
title = movie.find('td',class_='titleColumn').a.text
rating = movie.find('td',class_='ratingColumn imdbRating').strong.text
year = movie.find('td',class_='titleColumn').span.text.strip('()')