I am trying to tabulate data into three columns title, release date and continuity using pandas. i am tryig to fetch my dataset by scraping the data from the Released films section of this wikipedia page and i tried following the steps from this Youtube Video.
here is my code
import requests as r
from bs4 import BeautifulSoup
import pandas as pd
response = r.get("https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies")
wiki_text = response.text
soup = BeautifulSoup(wiki_text, "html.parser")
table_soup = soup.find_all("table")
filtered_table_soup = [table for table in table_soup if table.th is not None]
required_table = None
for table in filtered_table_soup:
if str(table.th.string).strip() == "Release date":
required_table = table
break
print(required_table)
When ever i run the code, it always return None instead of Release date.
I am new to web scrapping by the way, so please go easy on me.
Thank You.
CodePudding user response:
Unless BS4 is a requirement, you can just use panda to fetch all html tables on that page. It will make a DataFrame of each table and store that in an array. You can then loop through the array to find the table of interest.
import pandas as pd
url = r"https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies"
tables = pd.read_html(url) # Returns list of all tables on page
for tab in tables:
if "Release date" in tab.columns:
required_table = tab
CodePudding user response:
It's actually really simple:
The table is the second <table>
on the page, so use slicing to get the correct table:
import pandas as pd
URL = "https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies"
df = pd.read_html(URL, header=0)[1]
print(df.to_string())
Prints (truncated)
Title Release date Continuity Adapted from
0 Superman: Doomsday September 21, 2007 Standalone "The Death of Superman"
1 Justice League: The New Frontier February 26, 2008 Standalone DC: The New Frontier
2 Batman: Gotham Knight July 8, 2008 Nolanverse (unofficial)[2] Batman: "The Batman Nobody Knows"
3 Wonder Woman March 3, 2009 Standalone Wonder Woman: "Gods and Mortals"
4 Green Lantern: First Flight July 28, 2009 Standalone NaN
5 Superman/Batman: Public Enemies September 29, 2009 Superman/Batman[3] Superman/Batman: "Public Enemies"
6 Justice League: Crisis on Two Earths February 23, 2010 Crisis on Two Earths / Doom "Crisis on Earth-Three!" / JLA: Earth 2
7 Batman: Under the Red Hood July 7, 2010 Standalone Batman: "Under the Hood"
8
Or, if you want to specifically use BeautifulSoup, you can use a CSS selector to select the second table:
import requests
import pandas as pd
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies"
soup = BeautifulSoup(requests.get(URL).text, "html.parser")
# find the second table
table = soup.select_one("table:nth-of-type(2)")
df = pd.read_html(str(table))[0]
print(df.to_string())
CodePudding user response:
Try:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = soup.select_one('h2:has(#Released_films) table')
header = [th.text.strip() for th in table.select('th')]
data = []
for row in table.select('tr:has(td)'):
tds = [td.text.strip() for td in row.select('td')]
data.append(tds)
print(('{:<45}'*4).format(*header))
print('-' * (45*4))
for row in data:
print(('{:<45}'*len(row)).format(*row))
Prints:
Title Release date Continuity Adapted from
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Superman: Doomsday September 21, 2007 Standalone "The Death of Superman"
Justice League: The New Frontier February 26, 2008 Standalone DC: The New Frontier
Batman: Gotham Knight July 8, 2008 Nolanverse (unofficial)[2] Batman: "The Batman Nobody Knows"
Wonder Woman March 3, 2009 Standalone Wonder Woman: "Gods and Mortals"
Green Lantern: First Flight July 28, 2009 Standalone
Superman/Batman: Public Enemies September 29, 2009 Superman/Batman[3] Superman/Batman: "Public Enemies"
Justice League: Crisis on Two Earths February 23, 2010 Crisis on Two Earths / Doom "Crisis on Earth-Three!" / JLA: Earth 2
Batman: Under the Red Hood July 7, 2010 Standalone Batman: "Under the Hood"
Superman/Batman: Apocalypse September 28, 2010 Superman/Batman[3] Superman/Batman: "The Supergirl from Krypton"
All-Star Superman February 22, 2011 Standalone All-Star Superman
Green Lantern: Emerald Knights July 7, 2011 Standalone "New Blood" / "What Price Honor?" / "Mogo Doesn't Socialize" / "Tygers"
Batman: Year One October 18, 2011 Year One / Dark Knight Returns[4][5] Batman: Year One
Justice League: Doom February 28, 2012 Crisis on Two Earths / Doom JLA: "Tower of Babel"
Superman vs. The Elite June 12, 2012 Standalone "What's So Funny About Truth, Justice & the American Way?"
...and so on.