Home > Enterprise >  web scrapping a table without a class or Id
web scrapping a table without a class or Id

Time:09-18

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
body_class = "research-covid-client"
url = 'https://www.multistate.us/research/covid/public'
response = requests.get(body_class)
soup = BeautifulSoup(response.text, 'html.parser')

tiff_table = requests.get(url, attrs={'class': body_class})
df = pd.read_html(str(tiff_table))
print(df)

I tried this code above but I don't think it will work and believe the solution is to iterate over each tr and then put those into a dataframe. I tried this as well and got an error for the size and am not sure how to find the table size. I would really appreciate any help here.

CodePudding user response:

You can do that using only pandas.

Code:

import pandas as pd
dfs = pd.read_html('https://www.multistate.us/research/covid/public')
df = dfs[0]

print(df)

Output:

Jurisdiction  ...                                   Vaccine Mandates
0          Alabama  ...  Alabama prohibits government entities from iss...
1           Alaska  ...                                                NaN
2          Arizona  ...  Gov. Doug Ducey issued an executive order prev...
3         Arkansas  ...  Prohibits Arkansas state agencies or entities,...
4       California  ...  California requires that state employees and h...
5         Colorado  ...  All state employees must be vaccinated against...
6      Connecticut  ...  Connecticut mandates that all nursing home wor...
7         Delaware  ...  On August 12, 2021, Gov. Carney mandated that,...
8          Florida  ...  Florida prohibits governmental agencies, gover...
9          Georgia  ...  Georgia prohibits county or local governments ...
10          Hawaii  ...  All Hawaii state employees (state and county) ...
11           Idaho  ...  Gov. Brad Little signed an executive order pro...
12        Illinois  ...  Illinois implements a coronavirus vaccine or r...
13         Indiana  ...  Indiana prohibits the state government, any of...
14            Iowa  ...  On May 20, 2021, Governor Reynolds signed HF 8...
15          Kansas  ...  Kansas prohibits governmental agencies, buildi...
16        Kentucky  ...                                                NaN
17       Louisiana  ...                                                NaN
18           Maine  ...  Health care workers are required to be vaccina...
19        Maryland  ...  Maryland requires nursing home staff and healt...
20   Massachusetts  ...                                                NaN
21        Michigan  ...                                                NaN
22       Minnesota  ...  State employees must be fully vaccinated again...
23     Mississippi  ...  Mississippi does not currently have a vaccine ...
24        Missouri  ...  Missouri prohibits local, publicly funded enti...
25         Montana  ...  Gov. Gianforte issued an executive order prohi...
26        Nebraska  ...                                                NaN
27          Nevada  ...  Beginning August 15, 2021, state employees mus...
28   New Hampshire  ...  Prohibits local governments from mandating vac...
29      New Jersey  ...  New Jersey mandates that all public school tea...
30      New Mexico  ...  New Mexico requires all workers in healthcare ...
31        New York  ...  New York requires all healthcare workers in Ne...
32  North Carolina  ...  Beginning September 1, 2021, all Cabinet Agenc...
33    North Dakota  ...  In North Dakota, no government agency or busin...
34            Ohio  ...                                                NaN
35        Oklahoma  ...                                                NaN
36          Oregon  ...  Governor Brown announced that all executive br...
37    Pennsylvania  ...                                                NaN
38    Rhode Island  ...  On August 18, 2021, Gov. McKee mandated that a...
39  South Carolina  ...  South Carolina prohibits any agency, departmen...
40    South Dakota  ...  South Dakota prohibits state agencies, state b...
41       Tennessee  ...  Tennessee prohibits a state or local governmen...
42           Texas  ...  Texas prohibits any governmental entity from r...
43            Utah  ...                                                NaN
44         Vermont  ...  On September 8, 2021, Gov. Scott announced tha...
45        Virginia  ...  On August 5, 2021, Gov. Ralph Northan mandated...
46      Washington  ...  Governor Inslee ordered most state workers and...
47   West Virginia  ...                                                NaN
48       Wisconsin  ...                                                NaN
49         Wyoming  ...                                                NaN

[50 rows x 7 columns]

CodePudding user response:

There is an error in your code, you should call requests.get(url) and not requests.get(body_class)

That being said, a general solution to your problem (eg, extracting specific nodes in an html page) is to use XPath

My take on this using Parsel :

import requests
import pandas as pd
from parsel import Selector

url = "https://www.multistate.us/research/covid/public?level=state"
html = requests.get(url).text

selector = Selector(html)
table_html = selector.xpath('//table').get()

df = pd.read_html(table_html)[0]
print(df)

By playing with the URL and the XPath selector, you'll get where you need :)

  • Related