Home > database >  create a table using pandas using dataset from wikipedia table
create a table using pandas using dataset from wikipedia table

Time:01-05

I am trying to tabulate data into three columns title, release date and continuity using pandas. i am tryig to fetch my dataset by scraping the data from the Released films section of this wikipedia page and i tried following the steps from this Youtube Video.

here is my code

import requests as r
from bs4 import BeautifulSoup
import pandas as pd

response = r.get("https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies")

wiki_text = response.text
soup = BeautifulSoup(wiki_text, "html.parser")
table_soup = soup.find_all("table")
filtered_table_soup = [table for table in table_soup if table.th is not None]

required_table = None
for table in filtered_table_soup:
    if str(table.th.string).strip() == "Release date":
        required_table = table
        break
print(required_table)   

When ever i run the code, it always return None instead of Release date.

I am new to web scrapping by the way, so please go easy on me.

Thank You.

CodePudding user response:

Unless BS4 is a requirement, you can just use panda to fetch all html tables on that page. It will make a DataFrame of each table and store that in an array. You can then loop through the array to find the table of interest.

import pandas as pd
url = r"https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies"
tables = pd.read_html(url) # Returns list of all tables on page
for tab in tables: 
    if "Release date" in tab.columns:
        required_table = tab

CodePudding user response:

It's actually really simple: The table is the second <table> on the page, so use slicing to get the correct table:

import pandas as pd


URL = "https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies"
df = pd.read_html(URL, header=0)[1]

print(df.to_string())

Prints (truncated)

                                        Title        Release date                            Continuity                                                                                    Adapted from
0                              Superman: Doomsday  September 21, 2007                            Standalone                                                                         "The Death of Superman"
1                Justice League: The New Frontier   February 26, 2008                            Standalone                                                                            DC: The New Frontier
2                           Batman: Gotham Knight        July 8, 2008            Nolanverse (unofficial)[2]                                                               Batman: "The Batman Nobody Knows"
3                                    Wonder Woman       March 3, 2009                            Standalone                                                                Wonder Woman: "Gods and Mortals"
4                     Green Lantern: First Flight       July 28, 2009                            Standalone                                                                                             NaN
5                 Superman/Batman: Public Enemies  September 29, 2009                    Superman/Batman[3]                                                               Superman/Batman: "Public Enemies"
6            Justice League: Crisis on Two Earths   February 23, 2010           Crisis on Two Earths / Doom                                                         "Crisis on Earth-Three!" / JLA: Earth 2
7                      Batman: Under the Red Hood        July 7, 2010                            Standalone                                                                        Batman: "Under the Hood"
8                 

Or, if you want to specifically use BeautifulSoup, you can use a CSS selector to select the second table:

import requests
import pandas as pd
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies"

soup = BeautifulSoup(requests.get(URL).text, "html.parser")

# find the second table
table = soup.select_one("table:nth-of-type(2)")

df = pd.read_html(str(table))[0]
print(df.to_string())

CodePudding user response:

Try:

import requests
from bs4 import BeautifulSoup


url = 'https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = soup.select_one('h2:has(#Released_films)   table')

header = [th.text.strip() for th in table.select('th')]
data = []
for row in table.select('tr:has(td)'):
    tds = [td.text.strip() for td in row.select('td')]
    data.append(tds)

print(('{:<45}'*4).format(*header))
print('-' * (45*4))
for row in data:
    print(('{:<45}'*len(row)).format(*row))

Prints:

Title                                        Release date                                 Continuity                                   Adapted from                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Superman: Doomsday                           September 21, 2007                           Standalone                                   "The Death of Superman"                      
Justice League: The New Frontier             February 26, 2008                            Standalone                                   DC: The New Frontier                         
Batman: Gotham Knight                        July 8, 2008                                 Nolanverse (unofficial)[2]                   Batman: "The Batman Nobody Knows"            
Wonder Woman                                 March 3, 2009                                Standalone                                   Wonder Woman: "Gods and Mortals"             
Green Lantern: First Flight                  July 28, 2009                                Standalone                                                                                
Superman/Batman: Public Enemies              September 29, 2009                           Superman/Batman[3]                           Superman/Batman: "Public Enemies"            
Justice League: Crisis on Two Earths         February 23, 2010                            Crisis on Two Earths / Doom                  "Crisis on Earth-Three!" / JLA: Earth 2      
Batman: Under the Red Hood                   July 7, 2010                                 Standalone                                   Batman: "Under the Hood"                     
Superman/Batman: Apocalypse                  September 28, 2010                           Superman/Batman[3]                           Superman/Batman: "The Supergirl from Krypton"
All-Star Superman                            February 22, 2011                            Standalone                                   All-Star Superman                            
Green Lantern: Emerald Knights               July 7, 2011                                 Standalone                                   "New Blood" / "What Price Honor?" /  "Mogo Doesn't Socialize" / "Tygers"
Batman: Year One                             October 18, 2011                             Year One / Dark Knight Returns[4][5]         Batman: Year One                             
Justice League: Doom                         February 28, 2012                            Crisis on Two Earths / Doom                  JLA: "Tower of Babel"                        
Superman vs. The Elite                       June 12, 2012                                Standalone                                   "What's So Funny About Truth,  Justice & the American Way?"

...and so on.
  • Related