So, right now, what I'm trying to do is that I'm trying to scrape a table from rottentomatoes.com and but every time I run the code, I'm facing an issue that it just prints <a href tags. For now, all I want are the Movie titles numbered.
This is my code so far:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {"Accept-Language": "en-US, en;q=0.5"}
titles = []
year_released = []
def get_requests():
try:
result = requests.get(url=url)
soup = BeautifulSoup(result.text, 'html.parser')
table = soup.find('table', class_='table')
for name in table:
td = soup.find_all('a', class_='unstyled articleLink')
titles.append(td)
print(titles)
break
except:
print("The result could not get fetched")
And this is my output:
[[Opening This Week, Top Box Office, Coming Soon to Theaters, Weekend Earnings, Certified Fresh Movies, On Dvd & Streaming, VUDU, Netflix Streaming, iTunes, Amazon and Amazon Prime, Top DVD & Streaming, New Releases, Coming Soon to DVD, Certified Fresh Movies, Browse All, Top Movies, Trailers, Forums, View All , View All , Top TV Shows, Certified Fresh TV, 24 Frames, All-Time Lists, Binge Guide, Comics on TV, Countdown, Critics Consensus, Five Favorite Films, Now Streaming, Parental Guidance, Red Carpet Roundup, Scorecards, Sub-Cult, Total Recall, Video Interviews, Weekend Box Office, Weekly Ketchup, What to Watch, The Zeros, View All, View All, View All, It Happened One Night (1934), Citizen Kane (1941), The Wizard of Oz (1939), Modern Times (1936), Black Panther (2018), Parasite (Gisaengchung) (2019), Avengers: Endgame (2019), Casablanca (1942), Knives Out (2019), Us (2019), Toy Story 4 (2019), Lady Bird (2017), Mission: Impossible - Fallout (2018), BlacKkKlansman (2018), Get Out (2017), The Irishman (2019), The Godfather (1972), Mad Max: Fury Road (2015), Spider-Man: Into the Spider-Verse (2018), Moonlight (2016), Sunset Boulevard (1950), All About Eve (1950), The Cabinet of Dr. Caligari (Das Cabinet des Dr. Caligari) (1920), The Philadelphia Story (1940), Roma (2018), Wonder Woman (2017), A Star Is Born (2018), Inside Out (2015), A Quiet Place (2018), One Night in Miami (2020), Eighth Grade (2018), Rebecca (1940), Booksmart (2019), Logan (2017), His Girl Friday (1940), Portrait of a Lady on Fire (Portrait de la jeune fille en feu) (2020), Coco (2017), Dunkirk (2017), Star Wars: The Last Jedi (2017), A Night at the Opera (1935), The Shape of Water (2017), Thor: Ragnarok (2017), Spotlight (2015), The Farewell (2019), Selma (2014), The Third Man (1949), Rear Window (1954), E.T. The Extra-Terrestrial (1982), Seven Samurai (Shichinin no Samurai) (1956), La Grande illusion (Grand Illusion) (1938), Arrival (2016), Singin' in the Rain (1952), The Favourite (2018), Double Indemnity (1944), All Quiet on the Western Front (1930), Snow White and the Seven Dwarfs (1937), Marriage Story (2019), The Big Sick (2017), On the Waterfront (1954), Star Wars: Episode VII - The Force Awakens (2015), An American in Paris (1951), The Best Years of Our Lives (1946), Metropolis (1927), Boyhood (2014), Gravity (2013), Leave No Trace (2018), The Maltese Falcon (1941), The Invisible Man (2020), 12 Years a Slave (2013), Once Upon a Time In Hollywood (2019), Argo (2012), Soul (2020), Ma Rainey's Black Bottom (2020), The Kid (1921), Manchester by the Sea (2016), Nosferatu, a Symphony of Horror (Nosferatu, eine Symphonie des Grauens) (Nosferatu the Vampire) (1922), The Adventures of Robin Hood (1938), La La Land (2016), North by Northwest (1959), Laura (1944), Spider-Man: Far From Home (2019), Incredibles 2 (2018), Zootopia (2016), Alien (1979), King Kong (1933), Shadow of a Doubt (1943), Call Me by Your Name (2018), Psycho (1960), 1917 (2020), L.A. Confidential (1997), The Florida Project (2017), War for the Planet of the Apes (2017), Paddington 2 (2018), A Hard Day's Night (1964), Widows (2018), Never Rarely Sometimes Always (2020), Baby Driver (2017), Spider-Man: Homecoming (2017), The Godfather, Part II (1974), The Battle of Algiers (La Battaglia di Algeri) (1967), View All, View All]]
CodePudding user response:
You can apply pandas to get table data
import pandas as pd
import requests
from bs4 import BeautifulSoup
url='https://www.rottentomatoes.com/top/bestofrt/'
req=requests.get(url).text
soup=BeautifulSoup(req,'lxml')
table=soup.select_one('.table')
table_data =pd.read_html(str(table))[0]
print(table_data)
Output:
Rank ... No. of Reviews
0 1.0 ... 98
1 2.0 ... 121
2 3.0 ... 160
3 4.0 ... 109
4 5.0 ... 525
.. ... ... ...
95 96.0 ... 236
96 97.0 ... 396
97 98.0 ... 395
98 99.0 ... 114
99 100.0 ... 89
[100 rows x 4 columns]
CodePudding user response:
Reading tables via pandas.read_html()
as provided by @F.Hoque would probably the leaner approache but you can also get your results with BeautifulSoup
only.
Iterate over all <tr>
of the <table>
, pick information from tags
via .text
/ .get_text()
and store it structured in list of dicts:
data = []
for row in soup.select('table.table tr')[1:]:
data.append({
'rank': row.td.text,
'title': row.a.text.split(' (')[0].strip(),
'releaseYear': row.a.text.split(' (')[1][:-1]
})
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {"Accept-Language": "en-US, en;q=0.5"}
result = requests.get(url=url)
soup = BeautifulSoup(result.text, 'html.parser')
data = []
for row in soup.select('table.table tr')[1:]:
data.append({
'rank': row.td.text,
'title': row.a.text.split(' (')[0].strip(),
'releaseYear': row.a.text.split(' (')[1][:-1]
})
data
Output
[{'rank': '1.', 'title': 'It Happened One Night', 'releaseYear': '1934'},
{'rank': '2.', 'title': 'Citizen Kane', 'releaseYear': '1941'},
{'rank': '3.', 'title': 'The Wizard of Oz', 'releaseYear': '1939'},
{'rank': '4.', 'title': 'Modern Times', 'releaseYear': '1936'},
{'rank': '5.', 'title': 'Black Panther', 'releaseYear': '2018'},...]