Home > OS >  How do I remove the <a href... tags from my web scrapper
How do I remove the <a href... tags from my web scrapper

Time:03-11

So, right now, what I'm trying to do is that I'm trying to scrape a table from rottentomatoes.com and but every time I run the code, I'm facing an issue that it just prints <a href tags. For now, all I want are the Movie titles numbered.

This is my code so far:

from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {"Accept-Language": "en-US, en;q=0.5"}

titles = []
year_released = []

def get_requests():
  try:
    result = requests.get(url=url)

    soup = BeautifulSoup(result.text, 'html.parser')
    table = soup.find('table', class_='table')

    for name in table:
      td = soup.find_all('a', class_='unstyled articleLink')
      titles.append(td)
      print(titles)
      break
 except:
      print("The result could not get fetched")

And this is my output:

[[Opening This Week, Top Box Office, Coming Soon to Theaters, Weekend Earnings, Certified Fresh Movies, On Dvd & Streaming, VUDU, Netflix Streaming, iTunes, Amazon and Amazon Prime, Top DVD & Streaming, New Releases, Coming Soon to DVD, Certified Fresh Movies, Browse All, Top Movies, Trailers, Forums, View All , View All , Top TV Shows, Certified Fresh TV, 24 Frames, All-Time Lists, Binge Guide, Comics on TV, Countdown, Critics Consensus, Five Favorite Films, Now Streaming, Parental Guidance, Red Carpet Roundup, Scorecards, Sub-Cult, Total Recall, Video Interviews, Weekend Box Office, Weekly Ketchup, What to Watch, The Zeros, View All, View All, View All, It Happened One Night (1934), Citizen Kane (1941), The Wizard of Oz (1939), Modern Times (1936), Black Panther (2018), Parasite (Gisaengchung) (2019), Avengers: Endgame (2019), Casablanca (1942), Knives Out (2019), Us (2019), Toy Story 4 (2019), Lady Bird (2017), Mission: Impossible - Fallout (2018), BlacKkKlansman (2018), Get Out (2017), The Irishman (2019), The Godfather (1972), Mad Max: Fury Road (2015), Spider-Man: Into the Spider-Verse (2018), Moonlight (2016), Sunset Boulevard (1950), All About Eve (1950), The Cabinet of Dr. Caligari (Das Cabinet des Dr. Caligari) (1920), The Philadelphia Story (1940), Roma (2018), Wonder Woman (2017), A Star Is Born (2018), Inside Out (2015), A Quiet Place (2018), One Night in Miami (2020), Eighth Grade (2018), Rebecca (1940), Booksmart (2019), Logan (2017), His Girl Friday (1940), Portrait of a Lady on Fire (Portrait de la jeune fille en feu) (2020), Coco (2017), Dunkirk (2017), Star Wars: The Last Jedi (2017), A Night at the Opera (1935), The Shape of Water (2017), Thor: Ragnarok (2017), Spotlight (2015), The Farewell (2019), Selma (2014), The Third Man (1949), Rear Window (1954), E.T. The Extra-Terrestrial (1982), Seven Samurai (Shichinin no Samurai) (1956), La Grande illusion (Grand Illusion) (1938), Arrival (2016), Singin' in the Rain (1952), The Favourite (2018), Double Indemnity (1944), All Quiet on the Western Front (1930), Snow White and the Seven Dwarfs (1937), Marriage Story (2019), The Big Sick (2017), On the Waterfront (1954), Star Wars: Episode VII - The Force Awakens (2015), An American in Paris (1951), The Best Years of Our Lives (1946), Metropolis (1927), Boyhood (2014), Gravity (2013), Leave No Trace (2018), The Maltese Falcon (1941), The Invisible Man (2020), 12 Years a Slave (2013), Once Upon a Time In Hollywood (2019), Argo (2012), Soul (2020), Ma Rainey's Black Bottom (2020), The Kid (1921), Manchester by the Sea (2016), Nosferatu, a Symphony of Horror (Nosferatu, eine Symphonie des Grauens) (Nosferatu the Vampire) (1922), The Adventures of Robin Hood (1938), La La Land (2016), North by Northwest (1959), Laura (1944), Spider-Man: Far From Home (2019), Incredibles 2 (2018), Zootopia (2016), Alien (1979), King Kong (1933), Shadow of a Doubt (1943), Call Me by Your Name (2018), Psycho (1960), 1917 (2020), L.A. Confidential (1997), The Florida Project (2017), War for the Planet of the Apes (2017), Paddington 2 (2018), A Hard Day's Night (1964), Widows (2018), Never Rarely Sometimes Always (2020), Baby Driver (2017), Spider-Man: Homecoming (2017), The Godfather, Part II (1974), The Battle of Algiers (La Battaglia di Algeri) (1967), View All, View All]]

CodePudding user response:

You can apply pandas to get table data

import pandas as pd
import requests 
from bs4 import BeautifulSoup

url='https://www.rottentomatoes.com/top/bestofrt/'
req=requests.get(url).text
soup=BeautifulSoup(req,'lxml')
table=soup.select_one('.table')
table_data =pd.read_html(str(table))[0]
print(table_data)

Output:

     Rank  ... No. of Reviews
0     1.0  ...             98
1     2.0  ...            121
2     3.0  ...            160
3     4.0  ...            109
4     5.0  ...            525
..    ...  ...            ...
95   96.0  ...            236
96   97.0  ...            396
97   98.0  ...            395
98   99.0  ...            114
99  100.0  ...             89

[100 rows x 4 columns]

CodePudding user response:

Reading tables via pandas.read_html() as provided by @F.Hoque would probably the leaner approache but you can also get your results with BeautifulSoup only.

Iterate over all <tr> of the <table>, pick information from tags via .text / .get_text() and store it structured in list of dicts:

data = []

for row in soup.select('table.table tr')[1:]:
    data.append({
        'rank': row.td.text,
        'title': row.a.text.split(' (')[0].strip(),
        'releaseYear': row.a.text.split(' (')[1][:-1]
    })

Example

import requests
from bs4 import BeautifulSoup

url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {"Accept-Language": "en-US, en;q=0.5"}

result = requests.get(url=url)
soup = BeautifulSoup(result.text, 'html.parser')

data = []

for row in soup.select('table.table tr')[1:]:
    data.append({
        'rank': row.td.text,
        'title': row.a.text.split(' (')[0].strip(),
        'releaseYear': row.a.text.split(' (')[1][:-1]
    })

data

Output

[{'rank': '1.', 'title': 'It Happened One Night', 'releaseYear': '1934'},
 {'rank': '2.', 'title': 'Citizen Kane', 'releaseYear': '1941'},
 {'rank': '3.', 'title': 'The Wizard of Oz', 'releaseYear': '1939'},
 {'rank': '4.', 'title': 'Modern Times', 'releaseYear': '1936'},
 {'rank': '5.', 'title': 'Black Panther', 'releaseYear': '2018'},...]
  • Related