I am new to python and am trying to parse a table from the given website into a PANDAS DATAFRAME.
I am using modules requests-html, requests, and beautifulSoup.
Here is the website, I would like to gather the table from:
CodePudding user response:
The data you see on the page is embedded inside <script>
in form of JavaScript. You can use selenium
or parse the data manually from the page. I'm using js2py
module to decode the data:
import re
import js2py
import requests
import pandas as pd
url = "https://www.aamc.org/data-reports/workforce/interactive-data/active-physicians-largest-specialties-2019"
html_doc = requests.get(url).text
data = re.search(r"(?s)\$scope\.schools = (.*?);", html_doc).group(1)
data = [{k: v.strip() for k, v in d.items()} for d in js2py.eval_js(data)]
columns = {
"specialty": "Specialty",
"one": "Total Active Physicians",
"two": "Patient Care",
"three": "Teaching",
"four": "Research",
"five": "Other",
}
df = pd.DataFrame(data).rename(columns=columns)
print(df[list(columns.values())].to_markdown(index=False))
Prints:
Specialty | Total Active Physicians | Patient Care | Teaching | Research | Other |
---|---|---|---|---|---|
All Specialties | 938,980 | 816,922 | 12,475 | 12,632 | 96,951 |
Allergy and Immunology | 4,900 | 4,221 | 54 | 268 | 357 |
Anatomic/Clinical Pathology | 12,643 | 8,711 | 385 | 520 | 3,027 |
Anesthesiology | 42,267 | 39,377 | 540 | 180 | 2,170 |
Cardiovascular Disease | 22,521 | 20,430 | 299 | 573 | 1,219 |
Child and Adolescent Psychiatry | 9,787 | 8,670 | 134 | 109 | 874 |
Critical Care Medicine | 13,093 | 11,146 | 178 | 111 | 1,658 |
Dermatology | 12,516 | 11,747 | 100 | 98 | 571 |
Emergency Medicine | 45,202 | 41,466 | 469 | 94 | 3,173 |
Endocrinology, Diabetes, and Metabolism | 7,994 | 6,439 | 155 | 533 | 867 |
Family Medicine/General Practice | 118,198 | 108,984 | 1,614 | 251 | 7,349 |
Gastroenterology | 15,469 | 14,007 | 186 | 289 | 987 |
General Surgery | 25,564 | 21,949 | 259 | 137 | 3,219 |
Geriatric Medicine | 5,974 | 5,029 | 105 | 106 | 734 |
Hematology and Oncology | 16,274 | 13,506 | 250 | 871 | 1,647 |
Infectious Disease | 9,687 | 7,448 | 287 | 701 | 1,251 |
Internal Medicine | 120,171 | 105,736 | 1,409 | 1,447 | 11,579 |
Internal Medicine/Pediatrics | 5,509 | 4,924 | 74 | 28 | 483 |
Interventional Cardiology | 4,407 | 3,956 | 22 | 6 | 423 |
Neonatal-Perinatal Medicine | 5,919 | 5,008 | 135 | 175 | 601 |
Nephrology | 11,407 | 9,964 | 140 | 316 | 987 |
Neurological Surgery | 5,748 | 5,246 | 52 | 32 | 418 |
Neurology | 14,146 | 11,896 | 245 | 629 | 1,376 |
Neuroradiology | 4,089 | 3,496 | 63 | 7 | 523 |
Obstetrics and Gynecology | 42,720 | 39,825 | 499 | 195 | 2,201 |
Ophthalmology | 19,312 | 17,859 | 147 | 126 | 1,180 |
Orthopedic Surgery | 19,069 | 18,097 | 120 | 57 | 795 |
Otolaryngology | 9,777 | 9,140 | 90 | 23 | 524 |
Pain Medicine and Pain Management | 5,871 | 5,459 | 38 | 9 | 365 |
Pediatric Anesthesiology (Anesthesiology) | 2,571 | 2,127 | 47 | 4 | 393 |
Pediatric Cardiology | 2,966 | 2,414 | 74 | 64 | 414 |
Pediatric Critical Care Medicine | 2,639 | 2,118 | 78 | 20 | 423 |
Pediatric Hematology/Oncology | 3,079 | 2,251 | 77 | 210 | 541 |
Pediatrics | 60,618 | 54,764 | 844 | 663 | 4,347 |
Physical Medicine and Rehabilitation | 9,767 | 8,920 | 69 | 38 | 740 |
Plastic Surgery | 7,317 | 6,938 | 55 | 20 | 304 |
Preventive Medicine | 6,675 | 4,218 | 146 | 457 | 1,854 |
Psychiatry | 38,792 | 33,776 | 562 | 735 | 3,719 |
Pulmonary Disease | 5,106 | 4,490 | 138 | 296 | 182 |
Radiation Oncology | 5,306 | 4,854 | 56 | 33 | 363 |
Radiology and Diagnostic Radiology | 28,025 | 24,748 | 423 | 153 | 2,701 |
Rheumatology | 6,265 | 5,333 | 108 | 255 | 569 |
Sports Medicine | 2,897 | 2,624 | 20 | 4 | 249 |
Sports Medicine (Orthopedic Surgery) | 2,903 | 2,737 | 9 | 157 | |
Thoracic Surgery | 4,479 | 4,105 | 45 | 40 | 289 |
Urology | 10,201 | 9,593 | 76 | 39 | 493 |
Vascular and Interventional Radiology | 3,877 | 3,425 | 27 | 3 | 422 |
Vascular Surgery | 3,943 | 3,586 | 48 | 13 | 296 |