I am trying to get all reviews of a movie from here: https://www.rottentomatoes.com/m/interstellar_2014/reviews. But as you see on the web page they only show about 19 reviews. So I am unable to get all reviews my code bellow only prints the 19 first reviews.
## First we import the module necessary to open URLs (basically websites)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
def scrapUrl(URL):
""" scrap data from url - give url as a parameter """
page = urlopen(URL)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
#print(HTML)
soup = BeautifulSoup(html, "html.parser")
return soup
def findReviews(soup):
""" find reviews using """
NoneType = type(None)
reviews = []
for element in soup.find_all("div"):
i = element.get("class")
if isinstance(i, NoneType) == False:
if 'the_review' in i:
reviews.append(element.text)
dfrev = pd.DataFrame(reviews, columns= ['reviews'])
return dfrev
url = "https://www.rottentomatoes.com/m/interstellar_2014/reviews"
sc = scrapUrl(URL)
t = findReviews(sc)
print(t)
CodePudding user response:
You can do this without BeautifulSoup
, as rottentomatoes retrieves the reviews from an api. So you could first extract the movie id from the url with regex
, then loop api requests until the last page and load the data with pandas
:
import pandas as pd
import requests
import re
headers = {
'Referer': 'https://www.rottentomatoes.com/m/notebook/reviews?type=user',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
s = requests.Session()
def get_reviews(url):
r = requests.get(url)
movie_id = re.findall(r'(?<=movieId":")(.*)(?=","type)',r.text)[0]
api_url = f"https://www.rottentomatoes.com/napi/movie/{movie_id}/criticsReviews/all" #use reviews/userfor user reviews
payload = {
'direction': 'next',
'endCursor': '',
'startCursor': '',
}
review_data = []
while True:
r = s.get(api_url, headers=headers, params=payload)
data = r.json()
if not data['pageInfo']['hasNextPage']:
break
payload['endCursor'] = data['pageInfo']['endCursor']
payload['startCursor'] = data['pageInfo']['startCursor'] if data['pageInfo'].get('startCursor') else ''
review_data.extend(data['reviews'])
time.sleep(1)
return review_data
data = get_reviews('https://www.rottentomatoes.com/m/interstellar_2014/reviews')
df = pd.json_normalize(data)
creationDate | isFresh | isRotten | isRtUrl | isTop | reviewUrl | quote | reviewId | scoreOri | scoreSentiment | critic.name | critic.criticPictureUrl | critic.vanity | publication.id | publication.name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Oct 9, 2021 | True | False | False | False | https://www.nerdophiles.com/2014/11/05/interstellar-delivers-beauty-and-complexity-in-typical-nolan-fashion/ | The inherent message of the film brings hope, but it can definitely get waterlogged by intellectual speak and long-winded scenes. | 2830324 | 3/5 | POSITIVE | Therese Lacson | http://resizing.flixster.com/gGcp41zlZQ3sYdSbQoS8AATHp8Y=/128x128/v1.YzszODg1O2o7MTg5OTA7MjA0ODszMDA7MzAw | therese-lacson | 3888 | Nerdophiles |
1 | Aug 10, 2021 | True | False | False | False | https://www.centraltrack.com/space-oddity/ | The film is indeed a sight to behold -- and one that demands to be seen on the biggest possible screen. | 2812665 | B | POSITIVE | Kip Mooney | http://resizing.flixster.com/hoYjdO_o-Ip21XnJaWr0C27-nbc=/128x128/v1.YzszOTk2O2o7MTg5OTA7MjA0ODs0MDA7NDAw | kip-mooney | 2577 | Central Track |
2 | Feb 2, 2021 | True | False | False | False | http://www.richardcrouse.ca/interstellar-3-stars-one-for-each-hour-of-the-movie-sentimental-sic-fi/ | Nolan reaches for the stars with beautifully composed shots and some mind-bending special effects, but the dime store philosophy of the story never achieves lift off. | 2763105 | 3/5 | POSITIVE | Richard Crouse | http://resizing.flixster.com/Ep5q7RwWq9Ud5KBhnha2sPnsRD0=/128x128/v1.YzszODgxO2o7MTg5OTA7MjA0ODszMDA7MzAw | richard-crouse | 3900 | Richard Crouse |