Home > Blockchain >  How can I scrape a website that does not show any HTML codes in the source using Python without Sele
How can I scrape a website that does not show any HTML codes in the source using Python without Sele

Time:09-07

I would like to scrape a website that does not have any HTML codes in the page source using Python.

I have tried doing it with Selenium but I am trying to do it without Selenium as I have some difficulties importing this function to my Rasperry Pi.

https://www.cea.gov.sg/aceas/public-register/sales/1?page=1&pageSize=10&sortAscFlag=true&sort=name&registrationNumber=R

This is something that I am interested in scraping but I can't seem to do it effectively. I have tried using bs4 and requests to work with it, but there's no HTML codes for me to work with and I can't seem to find other libraries that can do it without Selenium.

import requests

r = requests.get('https://www.cea.gov.sg/aceas/public-register/sales/1?page=1&pageSize=10&sortAscFlag=true&sort=name&registrationNumber=R')
print(r.text)

This is a simplified version of what I have tried before.

D:\Codes\venv\Scripts\python.exe D:/Codes/requests_test.py 
<!DOCTYPE html><html lang=en><head><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=description content="The Council for Estate Agencies is the government agency that regulates Singaporeâs real estate agency industry."><link rel=icon href=/aceas/assets/common/favicon.ico><title>ACEAS</title><script src=https://assets.wogaa.sg/scripts/wogaa.js></script><script>(function(w, d, s, l, i) {
        w[l] = w[l] || [];
        w[l].push({ "gtm.start": new Date().getTime(), event: "gtm.js" });
        var f = d.getElementsByTagName(s)[0],
          j = d.createElement(s),
          dl = l != "dataLayer" ? "&l="   l : "";
        j.async = true;
        j.src = "https://www.googletagmanager.com/gtm.js?id="   i   dl;
        f.parentNode.insertBefore(j, f);
      })(window, document, "script", "dataLayer", "GTM-53ZNG4N");</script><link rel=preload as=style href=/aceas/assets/comp/vendor-style.css><link rel=stylesheet href=/aceas/assets/comp/vendor-style.css><link rel=preload as=style href=/aceas/assets/comp/index.css><link rel=stylesheet href=/aceas/assets/comp/index.css><link rel=preload as=style href=/aceas/assets/comp/formBase-minified.css><link rel=stylesheet href=/aceas/assets/comp/formBase-minified.css><link rel=preload as=style href=/aceas/assets/comp/rteComp.css><link rel=stylesheet href=/aceas/assets/comp/rteComp.css><link href=/aceas/assets/common/css/ErrorPage.css rel=prefetch><link href=/aceas/assets/common/css/Login.css rel=prefetch><link href=/aceas/assets/common/css/MaintenancePage.css rel=prefetch><link href=/aceas/assets/common/css/UserProfile.css rel=prefetch><link href=/aceas/assets/common/css/Workspace.css rel=prefetch><link href=/aceas/assets/common/js/ErrorPage.js rel=prefetch><link href=/aceas/assets/common/js/Login.js rel=prefetch><link href=/aceas/assets/common/js/LogoutCallback.js rel=prefetch><link href=/aceas/assets/common/js/MaintenancePage.js rel=prefetch><link href=/aceas/assets/common/js/MicroAppsContainer.js rel=prefetch><link href=/aceas/assets/common/js/OidcCallback.js rel=prefetch><link href=/aceas/assets/common/js/SilentRenewCallback.js rel=prefetch><link href=/aceas/assets/common/js/Survey.js rel=prefetch><link href=/aceas/assets/common/js/UserProfile.js rel=prefetch><link href=/aceas/assets/common/js/Workspace.js rel=prefetch><link href=/aceas/assets/common/css/index.css rel=preload as=style><link href=/aceas/assets/common/js/chunk-vendors.js rel=preload as=script><link href=/aceas/assets/common/js/index.js rel=preload as=script><link href=/aceas/assets/common/css/index.css rel=stylesheet></head><body class=xb-body><div id=common></div><noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-53ZNG4N" height=0 width=0 style=display:none;visibility:hidden></iframe></noscript><script src=/aceas/assets/common/vendor/[email protected]/vue.min.js></script><script src=/aceas/assets/common/vendor/[email protected]/vue-router.min.js></script><script src=/aceas/assets/common/vendor/[email protected]/vuex.js></script><script src=/aceas/assets/common/vendor/regenerator-runtime/runtime.js></script><script src=/aceas/assets/comp/index.js></script><script src=/aceas/assets/comp/formBase-minified.js></script><script src=/aceas/assets/comp/rteComp.js></script><script src=/aceas/assets/comp/multiStepFormWrapper.js></script><script src=/aceas/assets/common/js/chunk-vendors.js></script><script src=/aceas/assets/common/js/index.js></script></body></html>

Process finished with exit code 0

CodePudding user response:

In hope that your next question(s) will contain a minimal reproducible example, here is one way to scrape that information: data is being hydrated into page dynamically, via XHR calls to an API. You can see that by inspecting Dev tools - Network tab in your browser.

import requests
import pandas as pd
from tqdm import tqdm

big_df = pd.DataFrame()

headers = {
    'Origin': 'https://www.cea.gov.sg',
    'Content-Type': 'application/json;charset=UTF-8',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

s = requests.Session()
s.headers.update(headers)
url = 'https://www.cea.gov.sg/aceas/api/internet/profile/v2/public-register/filter'

for x in tqdm(range(1, 10)):
    payload = '{"page":'   str(x)   ',"pageSize":100,"sortAscFlag":"true","registrationNumber":"R","sort":"name","profileType":2}'
    r = s.post(url, data=payload)
    df = pd.json_normalize(r.json()['data'])
    big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)

This will display the datadframe in terminal:

id  name    businessName    licenseNumber   validityDateStart   validityDateEnd awards  disciplinaryActions registrationNumber  photoUrl    currentEa
0   0679f4a5-ca99-4c6c-a4c4-12528ece6294    'AFFAN BIN ASHAK HARI   'AFFAN A.H. L3002382K   2019-02-27T00:00:00 08:00   2022-12-31T23:59:59.99 08:00    None    None    R060832J    None    ERA REALTY NETWORK PTE LTD
1   c55b6be6-15fa-490c-a688-745a91839596    AARON BAN QI WEI    AARON BAN   L3002382K   2019-08-30T00:00:00 08:00   2022-12-31T23:59:59.99 08:00    None    None    R061593I    None    ERA REALTY NETWORK PTE LTD
2   6525dc35-5d0b-467f-8fc1-68894884e3fb    AARON GOH JIN HAO   None    L3008022J   2019-09-16T00:00:00 08:00   2022-12-31T19:14:00 08:00   None    None    R052117I    None    PROPNEX REALTY PTE. LTD.
3   38d64cfc-5cb2-4027-9add-f01ee4ed8769    AARON HUAN SHEN LI  AARON HUAN  L3008022J   2019-01-01T00:00:00 08:00   2022-12-31T14:57:00 08:00   None    None    R041988I    None    PROPNEX REALTY PTE. LTD.
4   15aa903c-2402-4bed-87d2-ef6fce88a502    AARON LEONG JIA SHENG   None    L3008022J   2021-01-01T00:00:00 08:00   2022-12-31T23:59:59.99 08:00    None    None    R062835F    None    PROPNEX REALTY PTE. LTD.
... ... ... ... ... ... ... ... ... ... ... ...
895 d845c8a7-0713-40ba-b0d9-7f06933997ee    ANG YAM NEE, AGNES JAEL JAEL ANG    L3002382K   2017-08-28T00:00:00 08:00   2022-12-31T20:14:00 08:00   None    None    R058715C    None    ERA REALTY NETWORK PTE LTD
896 705a51a0-2024-4fc3-9a84-27530a579681    ANG YAN BRYAN ANG   L3008022J   2018-02-27T00:00:00 08:00   2022-12-31T23:59:59.99 08:00    None    None    R009088G    None    PROPNEX REALTY PTE. LTD.
897 968a17a8-b0bb-455f-a95b-0281084d1da5    ANG YANG MING   ANG YM  L3008022J   2011-01-01T00:00:00 08:00   2022-12-31T20:24:00 08:00   None    None    R009471H    None    PROPNEX REALTY PTE. LTD.
898 98fbdf7a-107a-4980-a7b4-82560123769e    ANG YAP CHOW    Ang Yap Chow    L3008899K   2021-04-30T00:00:00 08:00   2022-12-31T00:21:00 08:00   None    None    R063534D    None    HUTTONS ASIA PTE. LTD.
899 848efbcf-3050-4a1d-9394-bf4e8051f4a0    ANG YAP HWEE    SUNNY   L3010497H   2018-01-01T00:00:00 08:00   2022-12-31T11:44:00 08:00   None    None    R040155F    None    ASSET PROPERTY PRIVATE LIMITED

There are 341 pages, and you can go through all, the example above is only pulling the first 10 pages of data.

Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/index.html

Requests documentation: https://requests.readthedocs.io/en/latest/

  • Related