Home > front end >  scraping all reviews of a movie from Rotten tomato using soup
scraping all reviews of a movie from Rotten tomato using soup


I am trying to get all reviews of a movie from here: https://www.rottentomatoes.com/m/interstellar_2014/reviews. But as you see on the web page they only show about 19 reviews. So I am unable to get all reviews my code bellow only prints the 19 first reviews.

## First we import the module necessary to open URLs (basically websites)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
def scrapUrl(URL):
  """ scrap data from url - give url as a parameter """
  page = urlopen(URL)
  html_bytes = page.read()
  html = html_bytes.decode("utf-8")
  soup = BeautifulSoup(html, "html.parser")
  return soup   
def findReviews(soup):
  """ find reviews using  """
  NoneType = type(None)
  reviews = []
  for element in soup.find_all("div"):
    i = element.get("class")
    if isinstance(i, NoneType) == False:
      if 'the_review' in i:
  dfrev = pd.DataFrame(reviews, columns= ['reviews'])
  return dfrev
  url = "https://www.rottentomatoes.com/m/interstellar_2014/reviews"
  sc = scrapUrl(URL)
  t = findReviews(sc)

CodePudding user response:

You can do this without BeautifulSoup, as rottentomatoes retrieves the reviews from an api. So you could first extract the movie id from the url with regex, then loop api requests until the last page and load the data with pandas:

import pandas as pd
import requests
import re

headers = {
    'Referer': 'https://www.rottentomatoes.com/m/notebook/reviews?type=user',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',

s = requests.Session()
def get_reviews(url):
    r = requests.get(url)
    movie_id = re.findall(r'(?<=movieId":")(.*)(?=","type)',r.text)[0]

    api_url = f"https://www.rottentomatoes.com/napi/movie/{movie_id}/criticsReviews/all" #use reviews/userfor user reviews
    payload = {
        'direction': 'next',
        'endCursor': '',
        'startCursor': '',
    review_data = []
    while True:
        r = s.get(api_url, headers=headers, params=payload)
        data = r.json()

        if not data['pageInfo']['hasNextPage']:

        payload['endCursor'] = data['pageInfo']['endCursor']
        payload['startCursor'] = data['pageInfo']['startCursor'] if data['pageInfo'].get('startCursor') else ''

    return review_data

data = get_reviews('https://www.rottentomatoes.com/m/interstellar_2014/reviews')
df = pd.json_normalize(data)
creationDate isFresh isRotten isRtUrl isTop reviewUrl quote reviewId scoreOri scoreSentiment critic.name critic.criticPictureUrl critic.vanity publication.id publication.name
0 Oct 9, 2021 True False False False https://www.nerdophiles.com/2014/11/05/interstellar-delivers-beauty-and-complexity-in-typical-nolan-fashion/ The inherent message of the film brings hope, but it can definitely get waterlogged by intellectual speak and long-winded scenes. 2830324 3/5 POSITIVE Therese Lacson http://resizing.flixster.com/gGcp41zlZQ3sYdSbQoS8AATHp8Y=/128x128/v1.YzszODg1O2o7MTg5OTA7MjA0ODszMDA7MzAw therese-lacson 3888 Nerdophiles
1 Aug 10, 2021 True False False False https://www.centraltrack.com/space-oddity/ The film is indeed a sight to behold -- and one that demands to be seen on the biggest possible screen. 2812665 B POSITIVE Kip Mooney http://resizing.flixster.com/hoYjdO_o-Ip21XnJaWr0C27-nbc=/128x128/v1.YzszOTk2O2o7MTg5OTA7MjA0ODs0MDA7NDAw kip-mooney 2577 Central Track
2 Feb 2, 2021 True False False False http://www.richardcrouse.ca/interstellar-3-stars-one-for-each-hour-of-the-movie-sentimental-sic-fi/ Nolan reaches for the stars with beautifully composed shots and some mind-bending special effects, but the dime store philosophy of the story never achieves lift off. 2763105 3/5 POSITIVE Richard Crouse http://resizing.flixster.com/Ep5q7RwWq9Ud5KBhnha2sPnsRD0=/128x128/v1.YzszODgxO2o7MTg5OTA7MjA0ODszMDA7MzAw richard-crouse 3900 Richard Crouse
  • Related