I know that there are many similar questions here already, but none of them gives me a satisfying answer for my problem. So here it is:
We need to create a dataframe from the top 250 movies from IMDb for an assignment. So we need to scrape the data first using BeautifulSoup.
These are the attributes that we need to scrape:
IMDb id (0111161)
Movie name (The Shawshank Redemption)
Year (1994)
Director (Frank Darabont)
Stars (Tim Robbins, Morgan Freeman, Bob Gunton)
Rating (9.3)
Number of reviews (2.6M)
Genres (Drama)
Country (USA)
Language (English)
Budget ($25,000,000)
Gross box Office Revenue ($28,884,504)
So far, I have managed to get only a few of them. I received all the separate URLs for all the movies, and now I loop over them. This is how the loop looks so far:
for x in np.arange(0, len(top_250_links)):
url=top_250_links[x]
req = requests.get(url)
page = req.text
soup = bs(page, 'html.parser')
# ID
# Movie Name
Movie_name=(soup.find("div",{'class':"sc-dae4a1bc-0 gwBsXc"}).get_text(strip=True).split(': ')[1])
# Year
year =(soup.find("a",{'class':"ipc-link ipc-link--baseAlt ipc-link--inherit-color sc-8c396aa2-1 WIUyh"}).get_text())
# Length
# Director
director = (soup.find("a",{'class':"ipc-metadata-list-item__list-content-item"}).get_text())
# Stars
stars = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
# Rating
rating = (soup.find("span",{'class':"sc-7ab21ed2-1 jGRxWM"}).get_text())
rating = float(rating)
# Number of Reviews
reviews = (soup.find("span",{'class':"score"}).get_text())
reviews = reviews.split('K')[0]
reviews = float(reviews)*1000
reviews = int(reviews)
# Genres
genres = (soup.find("span",{'class':"ipc-chip__text"}).get_text())
# Language
# Country
# Budget
meta = (soup.find("div" ,{'class':"ipc-metadata-list-item__label ipc-metadata-list-item__label--link"}))
# Gross box Office Revenue
gross = (soup.find("span",{'class':"ipc-metadata-list-item__list-content-item"}).get_text())
# Combine
movie_dict={
'Rank':x 1,
'ID': 0,
'Movie Name' : Movie_name,
'Year' : year,
'Length' : 0,
'Director' : director,
'Stars' : stars,
'Rating' : rating,
'Number of Reviewes' : reviews,
'Genres' : genres,
'Language': 0,
'Country': 0,
'Budget' : 0,
'Gross box Office Revenue' :0}
df = df.append(pd.DataFrame.from_records([movie_dict],columns=movie_dict.keys() ) )
I can't find a way to obtain the missing information. If anybody here has experience with this kind of topic and might be able to share his thoughts, it would help a lot of people. I think the task is not new and has been solved hundreds of times, but IMDb changed the classes and the structure in their HTML.
Thanks in advance.
CodePudding user response:
BeautifulSoup
has many functions to search elements. it is good to read all documentation
You can create more complex code using many .find()
with .parent
, etc.
soup.find(text='Language').parent.parent.find('a').text
For some elements you can also use data-testid="...."
soup.find('li', {'data-testid': 'title-details-languages'}).find('a').text
Minimale working code (for The Shawshank Redemption
)
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=1a264172-ae11-42e4-8ef7-7fed1973bb8f&pf_rd_r=A453PT2BTBPG41Y0HKM8&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1'
response = requests.get(url)
soup = BS(response.text, 'html.parser')
print('Language:', soup.find(text='Language').parent.parent.find('a').get_text(strip=True))
print('Country of origin:', soup.find(text='Country of origin').parent.parent.find('a').get_text(strip=True))
for name in ('Language', 'Country of origin'):
value = soup.find(text=name).parent.parent.find('a').get_text(strip=True)
print(name, ':', value)
print('Language:', soup.find('li', {'data-testid':'title-details-languages'}).find('a').get_text(strip=True))
print('Country of origin:', soup.find('li', {'data-testid':'title-details-origin'}).find('a').get_text(strip=True))
for name, testid in ( ('Language', 'title-details-languages'), ('Country of origin', 'title-details-origin')):
value = soup.find('li', {'data-testid':testid}).find('a').get_text(strip=True)
print(name, ':', value)
Result:
Language: English
Country of origin: United States
Language : English
Country of origin : United States
Language: English
Country of origin: United States
Language : English
Country of origin : United States