scraping all reviews of a movie from Rotten tomato using soup-CodePudding

I am trying to get all reviews of a movie from here: https://www.rottentomatoes.com/m/interstellar_2014/reviews. But as you see on the web page they only show about 19 reviews. So I am unable to get all reviews my code bellow only prints the 19 first reviews.

## First we import the module necessary to open URLs (basically websites)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
def scrapUrl(URL):
  """ scrap data from url - give url as a parameter """
  page = urlopen(URL)
  html_bytes = page.read()
  html = html_bytes.decode("utf-8")
  #print(HTML)
  soup = BeautifulSoup(html, "html.parser")
  return soup   
def findReviews(soup):
  """ find reviews using  """
  NoneType = type(None)
  reviews = []
  for element in soup.find_all("div"):
    i = element.get("class")
    if isinstance(i, NoneType) == False:
      if 'the_review' in i:
        reviews.append(element.text)
  dfrev = pd.DataFrame(reviews, columns= ['reviews'])
  return dfrev
  url = "https://www.rottentomatoes.com/m/interstellar_2014/reviews"
  sc = scrapUrl(URL)
  t = findReviews(sc)
  print(t)

CodePudding user response：

You can do this without BeautifulSoup, as rottentomatoes retrieves the reviews from an api. So you could first extract the movie id from the url with regex, then loop api requests until the last page and load the data with pandas:

import pandas as pd
import requests
import re

headers = {
    'Referer': 'https://www.rottentomatoes.com/m/notebook/reviews?type=user',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}

s = requests.Session()
        
def get_reviews(url):
    r = requests.get(url)
    movie_id = re.findall(r'(?<=movieId":")(.*)(?=","type)',r.text)[0]

    api_url = f"https://www.rottentomatoes.com/napi/movie/{movie_id}/criticsReviews/all" #use reviews/userfor user reviews
    
    payload = {
        'direction': 'next',
        'endCursor': '',
        'startCursor': '',
    }
    
    review_data = []
    
    while True:
        r = s.get(api_url, headers=headers, params=payload)
        data = r.json()

        if not data['pageInfo']['hasNextPage']:
            break

        payload['endCursor'] = data['pageInfo']['endCursor']
        payload['startCursor'] = data['pageInfo']['startCursor'] if data['pageInfo'].get('startCursor') else ''

        review_data.extend(data['reviews'])
        time.sleep(1)
    
    return review_data

data = get_reviews('https://www.rottentomatoes.com/m/interstellar_2014/reviews')
df = pd.json_normalize(data)

	creationDate	isFresh	isRotten	isRtUrl	isTop	reviewUrl	quote	reviewId	scoreOri	scoreSentiment	critic.name	critic.criticPictureUrl	critic.vanity	publication.id	publication.name
0	Oct 9, 2021	True	False	False	False	https://www.nerdophiles.com/2014/11/05/interstellar-delivers-beauty-and-complexity-in-typical-nolan-fashion/	The inherent message of the film brings hope, but it can definitely get waterlogged by intellectual speak and long-winded scenes.	2830324	3/5	POSITIVE	Therese Lacson	http://resizing.flixster.com/gGcp41zlZQ3sYdSbQoS8AATHp8Y=/128x128/v1.YzszODg1O2o7MTg5OTA7MjA0ODszMDA7MzAw	therese-lacson	3888	Nerdophiles
1	Aug 10, 2021	True	False	False	False	https://www.centraltrack.com/space-oddity/	The film is indeed a sight to behold -- and one that demands to be seen on the biggest possible screen.	2812665	B	POSITIVE	Kip Mooney	http://resizing.flixster.com/hoYjdO_o-Ip21XnJaWr0C27-nbc=/128x128/v1.YzszOTk2O2o7MTg5OTA7MjA0ODs0MDA7NDAw	kip-mooney	2577	Central Track
2	Feb 2, 2021	True	False	False	False	http://www.richardcrouse.ca/interstellar-3-stars-one-for-each-hour-of-the-movie-sentimental-sic-fi/	Nolan reaches for the stars with beautifully composed shots and some mind-bending special effects, but the dime store philosophy of the story never achieves lift off.	2763105	3/5	POSITIVE	Richard Crouse	http://resizing.flixster.com/Ep5q7RwWq9Ud5KBhnha2sPnsRD0=/128x128/v1.YzszODgxO2o7MTg5OTA7MjA0ODszMDA7MzAw	richard-crouse	3900	Richard Crouse