I'm trying to scrape rotten tomatoes with bs4
My aim is to find all a hrefs from the table but i cannot do it can you help me?
https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/
my code is
from urllib import request
from bs4 import BeautifulSoup as BS
import re
import pandas as pd
url = 'https://www.rottentomatoes.com/top/bestofrt'
html = request.urlopen(url)
bs = BS(html.read(), 'html.parser')
tags = bs.find_all('a', {'class':'articleLink unstyled'})[7:]
links = ['https://www.rottentomatoes.com' tag['href'] for tag in tags]
########################################### links ############################################################################
webpages = []
for link in reversed(links):
print(link)
html = request.urlopen(link)
bs = BS(html.read(), 'html.parser')
tags = bs.find_all('a', {'class':'unstyled articleLink'})[43:]
links = ['https://www.rottentomatoes.com' tag['href'] for tag in tags]
webpages.extend(links)
print(webpages)
I put a limit to 43 in order to avoid useless links except for movies but it is a short term solution and does not help I need to find an exact solution on how to scrape from table without scrape irrelevant information
thanks
CodePudding user response:
try this:
tags = bs.find_all(name='a', {'class':'unstyled articleLink'})[43:]
CodePudding user response:
Just grab the main table and then extract all the <a>
tags.
For example:
import requests
from bs4 import BeautifulSoup
rotten_tomatoes_url = 'https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/'
action_and_adventure = [
f"https://www.rottentomatoes.com{link.get('href')}"
for link in
BeautifulSoup(
requests.get(rotten_tomatoes_url).text,
"lxml",
)
.find("table", class_="table")
.find_all("a")
]
print(len(action_and_adventure))
print("\n".join(action_and_adventure[:10]))
Output (all 100 links to movies):
100
https://www.rottentomatoes.com/m/black_panther_2018
https://www.rottentomatoes.com/m/avengers_endgame
https://www.rottentomatoes.com/m/mission_impossible_fallout
https://www.rottentomatoes.com/m/mad_max_fury_road
https://www.rottentomatoes.com/m/spider_man_into_the_spider_verse
https://www.rottentomatoes.com/m/wonder_woman_2017
https://www.rottentomatoes.com/m/logan_2017
https://www.rottentomatoes.com/m/coco_2017
https://www.rottentomatoes.com/m/dunkirk_2017
https://www.rottentomatoes.com/m/star_wars_the_last_jedi