Home > Back-end >  find all a href from table
find all a href from table

Time:05-11

I'm trying to scrape rotten tomatoes with bs4

My aim is to find all a hrefs from the table but i cannot do it can you help me?

https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/

my code is

from urllib import request
from bs4 import BeautifulSoup as BS
import re
import pandas as pd

url = 'https://www.rottentomatoes.com/top/bestofrt'
html = request.urlopen(url)
bs = BS(html.read(), 'html.parser')


tags = bs.find_all('a', {'class':'articleLink unstyled'})[7:]

links = ['https://www.rottentomatoes.com'   tag['href'] for tag in tags]

########################################### links ############################################################################

webpages = []

for link in reversed(links):
    
    print(link)
    html = request.urlopen(link)
    bs = BS(html.read(), 'html.parser')
    tags = bs.find_all('a', {'class':'unstyled articleLink'})[43:]
    links = ['https://www.rottentomatoes.com'   tag['href'] for tag in tags]

    webpages.extend(links)

print(webpages)

I put a limit to 43 in order to avoid useless links except for movies but it is a short term solution and does not help I need to find an exact solution on how to scrape from table without scrape irrelevant information

thanks

CodePudding user response:

try this:

  tags = bs.find_all(name='a', {'class':'unstyled articleLink'})[43:]

CodePudding user response:

Just grab the main table and then extract all the <a> tags.

For example:

import requests
from bs4 import BeautifulSoup

rotten_tomatoes_url = 'https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/'
action_and_adventure = [
    f"https://www.rottentomatoes.com{link.get('href')}"
    for link in
    BeautifulSoup(
        requests.get(rotten_tomatoes_url).text,
        "lxml",
    )
    .find("table", class_="table")
    .find_all("a")
]

print(len(action_and_adventure))
print("\n".join(action_and_adventure[:10]))

Output (all 100 links to movies):

100
https://www.rottentomatoes.com/m/black_panther_2018
https://www.rottentomatoes.com/m/avengers_endgame
https://www.rottentomatoes.com/m/mission_impossible_fallout
https://www.rottentomatoes.com/m/mad_max_fury_road
https://www.rottentomatoes.com/m/spider_man_into_the_spider_verse
https://www.rottentomatoes.com/m/wonder_woman_2017
https://www.rottentomatoes.com/m/logan_2017
https://www.rottentomatoes.com/m/coco_2017
https://www.rottentomatoes.com/m/dunkirk_2017
https://www.rottentomatoes.com/m/star_wars_the_last_jedi
  • Related