Home > front end >  Scrape IMDB reviews and rating using Selenium
Scrape IMDB reviews and rating using Selenium

Time:04-27

I'm trying to scrape reviews and rating information for specific movies on IMDB. Here is my code for scraping rating:

 try:
     rating = review.find_element_by_css_selector('[class = "rating-other-user-rating"]')
     star_rating.append(rating.text)
 except:
     rating = None

Here is the HTML

<span >
        <svg  xmlns="http://www.w3.org/2000/svg" fill="#000000" height="24" viewBox="0 0 24 24" width="24">
            <path d="M0 0h24v24H0z" fill="none"></path>
            <path d="M12 17.27L18.18 21l-1.64-7.03L22 9.24l-7.19-.61L12 2 9.19 8.63 2 9.24l5.46 4.73L5.82 21z"></path>
            <path d="M0 0h24v24H0z" fill="none"></path>
        </svg>
            <span>7</span><span >/10</span>
        </span>

Questions:

  1. I need to retrieve "7" from the above HTML. What am I missing in the code to retrieve it. I think the problem is that the rating is located in a span tag with no classes or ID, and I can't figure it yet I would really appreciate the help. Thanks

  2. How can I scrape a certain number of reviews from IMDB? For example, if I want to scrape just 50 reviews. I tried using the code below but that doesn't work. The program continues executing and doesn't stop at 50:

      nextbutton = WebDriverWait(driver,5).until(EC.presence_of_element_located((By.CLASS_NAME,'ipl- load-more__button')))
    
      if len(movie_title) == 50: # movie_title is the number of reviews titles scraped so far. 50 is ideal
         break
    
      nextbutton.click()
    

CodePudding user response:

Try:

try:
    rating = review.find_element_by_css_selector('.rating-other-user-rating span')
    star_rating.append(rating.contents[0])
except:
    rating = None

provement whether it works or not:

html = '''
<span >
 <svg  fill="#000000" height="24" viewbox="0 0 24 24" width="24" xmlns="http://www.w3.org/2000/svg">
  <path d="M0 0h24v24H0z" fill="none">
  </path>
  <path d="M12 17.27L18.18 21l-1.64-7.03L22 9.24l-7.19-.61L12 2 9.19 8.63 2 9.24l5.46 4.73L5.82 21z">
  </path>
  <path d="M0 0h24v24H0z" fill="none">
  </path>
 </svg>
 <span>
  7
 </span>
 <span >
  /10
 </span>
</span>


'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'html.parser')

rating=soup.select_one('.rating-other-user-rating span')
print(rating.contents[0].strip())

Output:

7

CodePudding user response:

You were close enough. The rating score of 7 is within a <span> and is the second descendant of it's ancestor <span>

<span >
    <svg  xmlns="http://www.w3.org/2000/svg" fill="#000000" height="24" viewBox="0 0 24 24" width="24">
        <path d="M0 0h24v24H0z" fill="none"></path>
        <path d="M12 17.27L18.18 21l-1.64-7.03L22 9.24l-7.19-.61L12 2 9.19 8.63 2 9.24l5.46 4.73L5.82 21z"></path>
        <path d="M0 0h24v24H0z" fill="none"></path>
    </svg>
    <span>7</span>
    <span >/10</span>
</span>

Solution

To extract the text 7 ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

  • Using CSS_SELECTOR and text attribute:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.rating-other-user-rating span:first-of-type"))).text)
    
  • Using XPATH and get_attribute("innerHTML"):

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[@class='rating-other-user-rating']//span[not(@class)]"))).get_attribute("innerHTML"))
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python

  • Related