Home > Mobile >  Can't Scrape Dynamically Loaded HTML Table in an Aspx Website
Can't Scrape Dynamically Loaded HTML Table in an Aspx Website

Time:06-23

I am trying to scrape some data from the Arizona Medical Board. I search for Anesthesiology in the specialty dropdown list and I find that the table (with the links to the profiles I want to scrape) are dynamically loaded into the website. I notice when hitting the 'specialty search' button, a POST request is made to the server and the html table is actually returned from the server. I have tried simulating this post request to see if I get receive this html table and then try to parse it with bs4. Is this possible, and if so, am I even on the right track?

I have tried to included the form data I found in the network tab of the developer tools but I am not sure if this is the right data, or if I am forgetting some data here or in the header.

Please let me know if I need to clarify, I understand this may not be worded the best. Thank you!

import requests
# import re
import formdata

session = requests.Session()

url = "https://azbomprod.azmd.gov/GLSuiteWeb/Clients/AZBOM/public/WebVerificationSearch.aspx?q=azmd&t=20220622123512"

headers = {'User-Agent': 'My-Agent-Placeholder'}

res = session.get(url, headers=headers)

print("Response: {}".format(res))

payload = {
  "__VIEWSTATE": formdata.state,
  "__VIEWSTATEGENERATOR": formdata.generator,
  "__EVENTVALIDATION" : formdata.validation,
  "ctl00$ContentPlaceHolder1$Name": 'rbName1',
  "ctl00$ContentPlaceHolder1$Name": "rbName1",
  "ctl00$ContentPlaceHolder1$txtLastName" : '', 
  "ctl00$ContentPlaceHolder1$txtFirstName" : '',
  "ctl00$ContentPlaceHolder1$License": "rbLicense1",
  "ctl00$ContentPlaceHolder1$txtLicNum": '',
  "ctl00$ContentPlaceHolder1$Specialty": "rbSpecialty1",
  "ctl00$ContentPlaceHolder1$ddlSpecialty": '12155',
  "ctl00$ContentPlaceHolder1$ddlCounty": '15910',
  "ctl00$ContentPlaceHolder1$txtCity": '',
  "__EVENTTARGET": "ctl00$ContentPlaceHolder1$btnSpecial",
  "__EVENTARGUMENT": ''  
}

# params = {"q": "azmd",
# "t": "20220622123512"}

# #url = "https://azbomprod.azmd.gov/GLSuiteWeb/Clients/AZBOM/Public/Results.aspx"

res = session.post(url, data=payload, headers=headers)
print("Post response: {}".format(res))
print(res.text)

# res = requests.get('https://azbomprod.azmd.gov/GLSuiteWeb/Clients/AZBOM/Public/Results.aspx', headers=headers)

CodePudding user response:

Try:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0"
}

url = "https://azbomprod.azmd.gov/GLSuiteWeb/Clients/AZBOM/public/WebVerificationSearch.aspx?q=azmd&t=20220622082816"

with requests.session() as s:
    soup = BeautifulSoup(s.get(url, headers=headers).content, "html.parser")

    data = {}
    for inp in soup.select("input"):
        data[inp.get("name")] = inp.get("value", "")

    data["ctl00$ContentPlaceHolder1$Name"] = "rbName1"
    data["ctl00$ContentPlaceHolder1$License"] = "rbLicense1"
    data["ctl00$ContentPlaceHolder1$Specialty"] = "rbSpecialty1"
    data["ctl00$ContentPlaceHolder1$ddlSpecialty"] = "12155"
    data["ctl00$ContentPlaceHolder1$ddlCounty"] = "15910"

    data["__EVENTTARGET"] = "ctl00$ContentPlaceHolder1$btnSpecial"
    data["__EVENTARGUMENT"] = ""

    soup = BeautifulSoup(
        s.post(url, data=data, headers=headers).content, "html.parser"
    )
    for row in soup.select("tr:has(a)"):
        name = row.select("td")[-1].text
        link = row.a["href"]

        print("{:<35} {}".format(name, link))

Prints:

Abad-Pelsang, Elma A.               https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1620623&licID=121089&licType=1
Abadi, Bilal Ibrahim                https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1755530&licID=525771&licType=1
Abbasian, Mohammad                  https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1635449&licID=492537&licType=1
Abdel-Al, Naglaa Z.                 https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1637612&licID=175204&licType=1
Abedi, Babak                        https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1641219&licID=169009&licType=1
Abel, Martin D.                     https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1624271&licID=510929&licType=1
Abenstein, John P.                  https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1622930&licID=502482&licType=1

...and so on.
  • Related