Home > Blockchain >  Failed to extract html table data using Beautiful Soup in Python
Failed to extract html table data using Beautiful Soup in Python

Time:04-30

I am trying to replicate this code and to make some graphs, but I failed to get the csv file. I ran the exact same code but no avail as it print empty dataframe.

The code:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import geopandas as gpd
from prettytable import PrettyTable

url = 'https://www.mohfw.gov.in/'
# make a GET request to fetch the raw HTML content
web_content = requests.get(url).content

# parse the html content
soup = BeautifulSoup(web_content, "html.parser")

# remove any newlines and extra spaces from left and right
extract_contents = lambda row: [x.text.replace('\n', '') for x in row]

# find all table rows and data cells within
stats = [] 
all_rows = soup.find_all('tr')
for row in all_rows:
    stat = extract_contents(row.find_all('td')) 
# notice that the data that we require is now a list of length 5
    if len(stat) == 5:
        stats.append(stat)

#now convert the data into a pandas dataframe for further processing
new_cols = ["Sr.No", "States/UT","Confirmed","Recovered","Deceased"]
state_data = pd.DataFrame(data = stats, columns = new_cols)
state_data.head()

Any help is appreciated.

CodePudding user response:

The table data is dynamically loaded by JavaScript meaning from tbody to tr to all are dynamic only the table tag along with it's class .statetable.table.table is static meaning exist in html dom. So to mimic the table data you can use an automation tool something like selenium with bs4. So You need to pip install selenium and can use chromedrivermanager

from bs4 import BeautifulSoup
import time

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

driver.get('https://www.mohfw.gov.in/')
driver.maximize_window()
time.sleep(3)

soup = BeautifulSoup(driver.page_source,'lxml')

for tr in soup.select('.statetable.table.table-striped tbody tr')[0:37]:
    tr=list(tr.stripped_strings)
    print(tr)

Output:

['1', 'Andaman and Nicobar Islands', '1', '9905', '129']
['2', 'Andhra Pradesh', '16', '2', '2304929', '3', '14730']
['3', 'Arunachal Pradesh', '0', '64199', '296']
['4', 'Assam', '9', '716215', '7986']
['5', 'Bihar', '32', '2', '818261', '4', '12256']
['6', 'Chandigarh', '65', '5', '90811', '6', '1165']
['7', 'Chhattisgarh', '30', '2', '1138195', '4', '14034']
['8', 'Dadra and Nagar Haveli and Daman and Diu', '0', '11437', '4']
['9', 'Delhi', '5250', '418', '1848526', '1070', '26172', '2', '2']
['10', 'Goa', '43', '4', '241540', '2', '3832']
['11', 'Gujarat', '99', '6', '1213263', '20', '10943']
['12', 'Haryana', '2238', '217', '978537', '363', '10619']
['13', 'Himachal Pradesh', '62', '4', '280596', '12', '4134']
['14', 'Jammu and Kashmir', '65', '7', '449212', '4751']
['15', 'Jharkhand', '28', '429876', '1', '5317']
['16', 'Karnataka****', '1751', '4', '3905513', '116', '40099', '42', '42']
['17', 'Kerala***', '2770', '33', '6468929', '300', '68966', '14', '14']
['18', 'Ladakh', '3', '28014', '228']
['19', 'Lakshadweep', '0', '11350', '52']
['20', 'Madhya Pradesh', '95', '12', '1030550', '17', '10735']
['21', 'Maharashtra', '961', '6', '7728628', '157', '147840', '2', '2']
['22', 'Manipur', '17', '135083', '3', '2120']
['23', 'Meghalaya', '5', '2', '92199', '1593']
['24', 'Mizoram', '744', '14', '225896', '85', '696']
['25', 'Nagaland', '0', '34728', '760']
['26', 'Odisha', '133', '1278767', '7', '9124']
['27', 'Puducherry', '8', '1', '163815', '1962']
['28', 'Punjab', '178', '6', '741612', '20', '17748']
['29', 'Rajasthan', '255', '13', '1273699', '22', '9552']
['30', 'Sikkim', '3', '1', '38696', '452']
['31', 'Tamil Nadu', '488', '41', '3415316', '32', '38025']
['32', 'Telangana', '296', '20', '787539', '20', '4111']
['33', 'Tripura', '1', '99957', '922']
['34', 'Uttarakhand', '460', '10', '429299', '3', '7693']
['35', 'Uttar Pradesh', '1394', '10', '2048879', '208', '23506']
['36', 'West Bengal', '301', '13', '1996651', '21', '21201']
['Total#', '17801', '821', '42530622', '2496', '523753', '4', '56', '60']

CodePudding user response:

You can get all that data from there URI which allows to return JSON. You will need to map some column names and then do calculations with returned columns to derive the changes since yesterday. columns with new_ are today's values.

import pandas as pd
import requests

r = requests.get('https://www.mohfw.gov.in/data/datanew.json').json()
df = pd.DataFrame(r)
df
  • Related