I'm a web-scraping beginner and am trying to scrape this webpage: https://profiles.doe.mass.edu/statereport/ap.aspx
I'd like to be able to put in some settings at the top (like District, 2020-2021, Computer Science A, Female) and then download the resulting data for those settings.
Here's the code I'm currently using:
import requests
from bs4 import BeautifulSoup
url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
r = s.get('https://profiles.doe.mass.edu/statereport/ap.aspx')
soup = BeautifulSoup(r.text,"lxml")
data = {i['name']:i.get('value','') for i in soup.select('input[name]')}
data["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT",
data["ctl00$ContentPlaceHolder1$ddYear"]="2021",
data["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA",
data["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F",
p = s.post(url,data=data)
When I print out p.text
, I get a page with title '\t404 - Page Not Found\r\n'
and message
<h2>We are unable to locate information at: <br /><br '
'/>http://profiles.doe.mass.edu:80/statereport/ap.aspxp?ASP.NET_SessionId=bxfgao54wru50zl5tkmfml00</h2>\r\n'
Here's what data
looks like before I modify it:
{'__EVENTVALIDATION': '/wEdAFXz4796FFICjJ1Xc5ZOd9SwSHUlrrW 2y3gXxnnQf/b23Vhtt4oQyaVxTPpLLu5SKjKYgCipfSrKpW6jkHllWSEpW6/zTHqyc3IGH3Y0p/oA6xdsl0Dt4O8D2I0RxEvXEWFWVOnvCipZArmSoAj/6Nog6zUh Jhjqd1LNep6GtJczTu236xw2xaJFSzyG xo1ygDunu7BCYVmh LuKcW56TG5L0jGOqySgRaEMolHMgR0Wo68k/uWImXPWE YrUtgDXkgqzsktuw0QHVZv7mSDJ31NaBb64Fs9ARJ5Argo FxJW/LIaGGeAYoDphL88oao07IP77wrmH6t1R4d88C8ImDHG9DY3sCDemvzhV wJcnU4a5qVvRziPyzqDWnj3tqRclGoSw0VvVK9w C3/577Gx5gqF21UsZuYzfP4emcqvJ7ckTiBk7CpZkjUjM6Z9XchlxNjWi1LkzyZ8QMP0MaNCP4CVYJfndopwFzJC7kI3W106YIA/xglzXrSdmq6/MDUCczeqIsmRQGyTOkQFH724RllsbZyHoPHYvoSAJilrMQf6BUERVN4ojysx3fz5qZhZE7DWaJAC882mXz4mEtcevFrLwuVPD7iB2v2mlWoK0S5Chw4WavlmHC 9BRhT36jtBzSPRROlXuc6P9YehFJOmpQXqlVil7C9OylT4Kz5tYzrX9JVWEpeWULgo9Evm ipJZOKY2YnC41xTK/MbZFxsIxqwHA3IuS10Q5laFojoB e FDCqazV9MvcHllsPv2TK3N1oNHA8ODKnEABoLdRgumrTLDF8Lh k Y4EROoHhBaO3aMppAI52v3ajRcCFET22jbEm/5 P2TG2dhPhYgtZ8M/e/AoXht29ixVQ1ReO/6bhLIM i48RTmcl76n1mNjfimB8r3irXQGYIEqCkXlUHZ/SNlRYyx3obJ6E/eljlPveWNidFHOaj FznOh264qDkMm7fF78WBO2v0x or1WGijWDdQtRy9WRKXchYxUchmBlYm15YbBfMrIB7 77NJV M6uIVVnCyiDRGj oPXcTYxqSUCLrOMQyzYKJeu8/hWD0gOdKeoYUdUUJq4idIk bLYy76sI/N2aK aXZo/JPQ 23gTHzIlyi4Io7O6kXaULPs8rfo8hpkH1qXyKb/rP2VJBNWgyp8jOMx9px m4/e2Iecd86E4eN4Rk6OIiwqGp dMdgntXu5ruRHb1awPlVmDw92dL1P0b0XxJW7EGfMzyssMDhs1VT6K6iMUTHbuXkNGaEG1dP1h4ktnCwGqDLVutU6UuzT6i4nfqnvFjGK9 7Ze8qWIl8SYyhmvzmgpLjdMuF9CYMQ2Aa79HXLKFACsSSm0dyiU1/ZGyII2Fvga9o nVV1jZam3LkcAPaXEKwEyJXfN/DA7P4nFAaQ QP 2bSgrcw /dw 86OhPyG88qyJwqZODEXE1WB5zSOUywGb1/Xed7wq9WoRs6v8rAK5c/2iH7YLiJ4mUVDo 7WCKrzO5 Hsyah3frMKbheY1acRmSVUzRgCnTx7jvcLGR9Jbt6TredqZaWZBrDFcntdg7EHd7imK5PqjUld3iCVjdyO yLKUkMKiFD85G3vEferg/Q/TtfVBqeTU0ohP9d CsKOmV/dxVYWEtBcfa9KiN6j4N8pP7 3iUOhajojZ8jV98kxT0zPZlzkpqI4SwR6Ys8d2RjIi5K oQul4pL5u zZvX0lsLP9Jl7FeVTfBvST67T6ohz8dl9gBfmmbwnT23SyuFSUGd6ZGaKE 9kKYmuImW7w3ePs7C70yDWHpIpxP/IJ4GHb36LWto2g3Ld3goCQ4fXPu7C4iTiN6b5WUSlJJsWGF4eQkJue8=',
'__VIEWSTATE': '/wEPDwUKLTM0NzY4OTQ4NmRkDwwPzTpuna yxVhQxpRF4n2 zYKQtotwRPqzuCkRvyU=',
'__VIEWSTATEGENERATOR': '2B6F8D71',
'ctl00$ContentPlaceHolder1$btnViewReport': 'View Report',
'ctl00$ContentPlaceHolder1$hfExport': 'ViewReport',
'leftNavId': '11241',
'quickSearchValue': '',
'runQuickSearch': 'Y',
'searchType': 'QUICK',
'searchtext': ''}
Following suggestions from similar questions, I've tried playing around with the parameters, editing data
in various ways (to emulate the POST request that I see in my browser when I navigate the site myself), and specifying an ASP.NET_SessionId
, but to no avail.
How can I access the information from this website?
CodePudding user response:
This should be what you are looking for what I did was use bs4 to parse HTML data and then found the table. Then I get the rows and to make it easier to work with the data I put it into a dictionary.
import requests
from bs4 import BeautifulSoup
url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find_all('table')
rows = table[0].find_all('tr')
data = {}
for row in rows:
if row.find_all('th'):
keys = row.find_all('th')
for key in keys:
data[key.text] = []
else:
values = row.find_all('td')
for value in values:
data[keys[values.index(value)].text].append(value.text)
for key in data:
print(key, data[key][:10])
print('\n')
The output:
District Name ['Abington', 'Academy Of the Pacific Rim Charter Public (District)', 'Acton-Boxborough', 'Advanced Math and Science Academy Charter (District)', 'Agawam', 'Amesbury', 'Amherst-Pelham', 'Andover', 'Arlington', 'Ashburnham-Westminster']
District Code ['00010000', '04120000', '06000000', '04300000', '00050000', '00070000', '06050000', '00090000', '00100000', '06100000']
Tests Taken [' 100', ' 109', ' 1,070', ' 504', ' 209', ' 126', ' 178', ' 986', ' 893', ' 97']
Score=1 [' 16', ' 81', ' 12', ' 29', ' 27', ' 18', ' 5', ' 70', ' 72', ' 4']
Score=2 [' 31', ' 20', ' 55', ' 74', ' 65', ' 34', ' 22', ' 182', ' 149', ' 23']
Score=3 [' 37', ' 4', ' 158', ' 142', ' 55', ' 46', ' 37', ' 272', ' 242', ' 32']
Score=4 [' 15', ' 3', ' 344', ' 127', ' 39', ' 19', ' 65', ' 289', ' 270', ' 22']
Score=5 [' 1', ' 1', ' 501', ' 132', ' 23', ' 9', ' 49', ' 173', ' 160', ' 16']
% Score 1-2 [' 47.0', ' 92.7', ' 6.3', ' 20.4', ' 44.0', ' 41.3', ' 15.2', ' 25.6', ' 24.7', ' 27.8']
% Score 3-5 [' 53.0', ' 7.3', ' 93.7', ' 79.6', ' 56.0', ' 58.7', ' 84.8', ' 74.4', ' 75.3', ' 72.2']
Process finished with exit code 0
CodePudding user response:
I was able to get this working by adapting the code from here. I'm not sure why editing the payload in this way made the difference, so I'd be grateful for any insights!
Here's my working code, using Pandas to parse out the tables:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
response = s.get(url)
soup = BeautifulSoup(response.content, 'html5lib')
data = { tag['name']: tag['value']
for tag in soup.select('input[name^=ctl00]') if tag.get('value')
}
state = { tag['name']: tag['value']
for tag in soup.select('input[name^=__]')
}
payload = data.copy()
payload.update(state)
payload["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT",
payload["ctl00$ContentPlaceHolder1$ddYear"]="2021",
payload["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA",
payload["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F",
p = s.post(url,data=payload)
df = pd.read_html(p.text)[0]
df["District Code"] = df["District Code"].astype(str).str.zfill(8)
display(df)