I would like to scrape a website that does not have any HTML codes in the page source using Python.
I have tried doing it with Selenium but I am trying to do it without Selenium as I have some difficulties importing this function to my Rasperry Pi.
This is something that I am interested in scraping but I can't seem to do it effectively. I have tried using bs4 and requests to work with it, but there's no HTML codes for me to work with and I can't seem to find other libraries that can do it without Selenium.
import requests
r = requests.get('https://www.cea.gov.sg/aceas/public-register/sales/1?page=1&pageSize=10&sortAscFlag=true&sort=name®istrationNumber=R')
print(r.text)
This is a simplified version of what I have tried before.
D:\Codes\venv\Scripts\python.exe D:/Codes/requests_test.py
<!DOCTYPE html><html lang=en><head><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=description content="The Council for Estate Agencies is the government agency that regulates Singaporeâs real estate agency industry."><link rel=icon href=/aceas/assets/common/favicon.ico><title>ACEAS</title><script src=https://assets.wogaa.sg/scripts/wogaa.js></script><script>(function(w, d, s, l, i) {
w[l] = w[l] || [];
w[l].push({ "gtm.start": new Date().getTime(), event: "gtm.js" });
var f = d.getElementsByTagName(s)[0],
j = d.createElement(s),
dl = l != "dataLayer" ? "&l=" l : "";
j.async = true;
j.src = "https://www.googletagmanager.com/gtm.js?id=" i dl;
f.parentNode.insertBefore(j, f);
})(window, document, "script", "dataLayer", "GTM-53ZNG4N");</script><link rel=preload as=style href=/aceas/assets/comp/vendor-style.css><link rel=stylesheet href=/aceas/assets/comp/vendor-style.css><link rel=preload as=style href=/aceas/assets/comp/index.css><link rel=stylesheet href=/aceas/assets/comp/index.css><link rel=preload as=style href=/aceas/assets/comp/formBase-minified.css><link rel=stylesheet href=/aceas/assets/comp/formBase-minified.css><link rel=preload as=style href=/aceas/assets/comp/rteComp.css><link rel=stylesheet href=/aceas/assets/comp/rteComp.css><link href=/aceas/assets/common/css/ErrorPage.css rel=prefetch><link href=/aceas/assets/common/css/Login.css rel=prefetch><link href=/aceas/assets/common/css/MaintenancePage.css rel=prefetch><link href=/aceas/assets/common/css/UserProfile.css rel=prefetch><link href=/aceas/assets/common/css/Workspace.css rel=prefetch><link href=/aceas/assets/common/js/ErrorPage.js rel=prefetch><link href=/aceas/assets/common/js/Login.js rel=prefetch><link href=/aceas/assets/common/js/LogoutCallback.js rel=prefetch><link href=/aceas/assets/common/js/MaintenancePage.js rel=prefetch><link href=/aceas/assets/common/js/MicroAppsContainer.js rel=prefetch><link href=/aceas/assets/common/js/OidcCallback.js rel=prefetch><link href=/aceas/assets/common/js/SilentRenewCallback.js rel=prefetch><link href=/aceas/assets/common/js/Survey.js rel=prefetch><link href=/aceas/assets/common/js/UserProfile.js rel=prefetch><link href=/aceas/assets/common/js/Workspace.js rel=prefetch><link href=/aceas/assets/common/css/index.css rel=preload as=style><link href=/aceas/assets/common/js/chunk-vendors.js rel=preload as=script><link href=/aceas/assets/common/js/index.js rel=preload as=script><link href=/aceas/assets/common/css/index.css rel=stylesheet></head><body class=xb-body><div id=common></div><noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-53ZNG4N" height=0 width=0 style=display:none;visibility:hidden></iframe></noscript><script src=/aceas/assets/common/vendor/[email protected]/vue.min.js></script><script src=/aceas/assets/common/vendor/[email protected]/vue-router.min.js></script><script src=/aceas/assets/common/vendor/[email protected]/vuex.js></script><script src=/aceas/assets/common/vendor/regenerator-runtime/runtime.js></script><script src=/aceas/assets/comp/index.js></script><script src=/aceas/assets/comp/formBase-minified.js></script><script src=/aceas/assets/comp/rteComp.js></script><script src=/aceas/assets/comp/multiStepFormWrapper.js></script><script src=/aceas/assets/common/js/chunk-vendors.js></script><script src=/aceas/assets/common/js/index.js></script></body></html>
Process finished with exit code 0
CodePudding user response:
In hope that your next question(s) will contain a minimal reproducible example, here is one way to scrape that information: data is being hydrated into page dynamically, via XHR calls to an API. You can see that by inspecting Dev tools - Network tab in your browser.
import requests
import pandas as pd
from tqdm import tqdm
big_df = pd.DataFrame()
headers = {
'Origin': 'https://www.cea.gov.sg',
'Content-Type': 'application/json;charset=UTF-8',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
url = 'https://www.cea.gov.sg/aceas/api/internet/profile/v2/public-register/filter'
for x in tqdm(range(1, 10)):
payload = '{"page":' str(x) ',"pageSize":100,"sortAscFlag":"true","registrationNumber":"R","sort":"name","profileType":2}'
r = s.post(url, data=payload)
df = pd.json_normalize(r.json()['data'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)
This will display the datadframe in terminal:
id name businessName licenseNumber validityDateStart validityDateEnd awards disciplinaryActions registrationNumber photoUrl currentEa
0 0679f4a5-ca99-4c6c-a4c4-12528ece6294 'AFFAN BIN ASHAK HARI 'AFFAN A.H. L3002382K 2019-02-27T00:00:00 08:00 2022-12-31T23:59:59.99 08:00 None None R060832J None ERA REALTY NETWORK PTE LTD
1 c55b6be6-15fa-490c-a688-745a91839596 AARON BAN QI WEI AARON BAN L3002382K 2019-08-30T00:00:00 08:00 2022-12-31T23:59:59.99 08:00 None None R061593I None ERA REALTY NETWORK PTE LTD
2 6525dc35-5d0b-467f-8fc1-68894884e3fb AARON GOH JIN HAO None L3008022J 2019-09-16T00:00:00 08:00 2022-12-31T19:14:00 08:00 None None R052117I None PROPNEX REALTY PTE. LTD.
3 38d64cfc-5cb2-4027-9add-f01ee4ed8769 AARON HUAN SHEN LI AARON HUAN L3008022J 2019-01-01T00:00:00 08:00 2022-12-31T14:57:00 08:00 None None R041988I None PROPNEX REALTY PTE. LTD.
4 15aa903c-2402-4bed-87d2-ef6fce88a502 AARON LEONG JIA SHENG None L3008022J 2021-01-01T00:00:00 08:00 2022-12-31T23:59:59.99 08:00 None None R062835F None PROPNEX REALTY PTE. LTD.
... ... ... ... ... ... ... ... ... ... ... ...
895 d845c8a7-0713-40ba-b0d9-7f06933997ee ANG YAM NEE, AGNES JAEL JAEL ANG L3002382K 2017-08-28T00:00:00 08:00 2022-12-31T20:14:00 08:00 None None R058715C None ERA REALTY NETWORK PTE LTD
896 705a51a0-2024-4fc3-9a84-27530a579681 ANG YAN BRYAN ANG L3008022J 2018-02-27T00:00:00 08:00 2022-12-31T23:59:59.99 08:00 None None R009088G None PROPNEX REALTY PTE. LTD.
897 968a17a8-b0bb-455f-a95b-0281084d1da5 ANG YANG MING ANG YM L3008022J 2011-01-01T00:00:00 08:00 2022-12-31T20:24:00 08:00 None None R009471H None PROPNEX REALTY PTE. LTD.
898 98fbdf7a-107a-4980-a7b4-82560123769e ANG YAP CHOW Ang Yap Chow L3008899K 2021-04-30T00:00:00 08:00 2022-12-31T00:21:00 08:00 None None R063534D None HUTTONS ASIA PTE. LTD.
899 848efbcf-3050-4a1d-9394-bf4e8051f4a0 ANG YAP HWEE SUNNY L3010497H 2018-01-01T00:00:00 08:00 2022-12-31T11:44:00 08:00 None None R040155F None ASSET PROPERTY PRIVATE LIMITED
There are 341 pages, and you can go through all, the example above is only pulling the first 10 pages of data.
Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/index.html
Requests documentation: https://requests.readthedocs.io/en/latest/