from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
body_class = "research-covid-client"
url = 'https://www.multistate.us/research/covid/public'
response = requests.get(body_class)
soup = BeautifulSoup(response.text, 'html.parser')
tiff_table = requests.get(url, attrs={'class': body_class})
df = pd.read_html(str(tiff_table))
print(df)
I tried this code above but I don't think it will work and believe the solution is to iterate over each tr and then put those into a dataframe. I tried this as well and got an error for the size and am not sure how to find the table size. I would really appreciate any help here.
CodePudding user response:
You can do that using only pandas.
Code:
import pandas as pd
dfs = pd.read_html('https://www.multistate.us/research/covid/public')
df = dfs[0]
print(df)
Output:
Jurisdiction ... Vaccine Mandates
0 Alabama ... Alabama prohibits government entities from iss...
1 Alaska ... NaN
2 Arizona ... Gov. Doug Ducey issued an executive order prev...
3 Arkansas ... Prohibits Arkansas state agencies or entities,...
4 California ... California requires that state employees and h...
5 Colorado ... All state employees must be vaccinated against...
6 Connecticut ... Connecticut mandates that all nursing home wor...
7 Delaware ... On August 12, 2021, Gov. Carney mandated that,...
8 Florida ... Florida prohibits governmental agencies, gover...
9 Georgia ... Georgia prohibits county or local governments ...
10 Hawaii ... All Hawaii state employees (state and county) ...
11 Idaho ... Gov. Brad Little signed an executive order pro...
12 Illinois ... Illinois implements a coronavirus vaccine or r...
13 Indiana ... Indiana prohibits the state government, any of...
14 Iowa ... On May 20, 2021, Governor Reynolds signed HF 8...
15 Kansas ... Kansas prohibits governmental agencies, buildi...
16 Kentucky ... NaN
17 Louisiana ... NaN
18 Maine ... Health care workers are required to be vaccina...
19 Maryland ... Maryland requires nursing home staff and healt...
20 Massachusetts ... NaN
21 Michigan ... NaN
22 Minnesota ... State employees must be fully vaccinated again...
23 Mississippi ... Mississippi does not currently have a vaccine ...
24 Missouri ... Missouri prohibits local, publicly funded enti...
25 Montana ... Gov. Gianforte issued an executive order prohi...
26 Nebraska ... NaN
27 Nevada ... Beginning August 15, 2021, state employees mus...
28 New Hampshire ... Prohibits local governments from mandating vac...
29 New Jersey ... New Jersey mandates that all public school tea...
30 New Mexico ... New Mexico requires all workers in healthcare ...
31 New York ... New York requires all healthcare workers in Ne...
32 North Carolina ... Beginning September 1, 2021, all Cabinet Agenc...
33 North Dakota ... In North Dakota, no government agency or busin...
34 Ohio ... NaN
35 Oklahoma ... NaN
36 Oregon ... Governor Brown announced that all executive br...
37 Pennsylvania ... NaN
38 Rhode Island ... On August 18, 2021, Gov. McKee mandated that a...
39 South Carolina ... South Carolina prohibits any agency, departmen...
40 South Dakota ... South Dakota prohibits state agencies, state b...
41 Tennessee ... Tennessee prohibits a state or local governmen...
42 Texas ... Texas prohibits any governmental entity from r...
43 Utah ... NaN
44 Vermont ... On September 8, 2021, Gov. Scott announced tha...
45 Virginia ... On August 5, 2021, Gov. Ralph Northan mandated...
46 Washington ... Governor Inslee ordered most state workers and...
47 West Virginia ... NaN
48 Wisconsin ... NaN
49 Wyoming ... NaN
[50 rows x 7 columns]
CodePudding user response:
There is an error in your code, you should call requests.get(url)
and not requests.get(body_class)
That being said, a general solution to your problem (eg, extracting specific nodes in an html page) is to use XPath
My take on this using Parsel :
import requests
import pandas as pd
from parsel import Selector
url = "https://www.multistate.us/research/covid/public?level=state"
html = requests.get(url).text
selector = Selector(html)
table_html = selector.xpath('//table').get()
df = pd.read_html(table_html)[0]
print(df)
By playing with the URL and the XPath selector, you'll get where you need :)