How can I web scrape a government website with python? I cannot properly do it, the table just canno-CodePudding

I am trying to web scrape the data from this government website: https://www.itf.gov.hk/en/project-search/search-result/index.html?isAdvSearch=1&Programmes=TVP However, after reading a lot about web scrapping, following Youtube video, I still can’t do it. Can someone please help?

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://www.itf.gov.hk/en/project-search/project-profile/index.html?ReferenceNo=TVP/2122/22'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
soup 

table = soup.find('table',{'class':'colorTbl projDetailTbl'})
headers=[]
for i in table.find_all('th'):
    title = i.text.strip()
    headers.append(title)
 
df = pd.DataFrame(columns=headers)
 
for row in table.find_all('tr')[1:]:
    data = row.find_all('td')
    row_data = [td.text.strip() for td in data]
    length = len(df)
    df.loc[length] = row_data

The table didn't show at all. The result is none. Please help me.

CodePudding user response：

The table is rendered through javascript, and the data returned through an api. You need to get the data from the source.

Code:

import pandas as pd
import requests


tokenUrl = 'https://www.itf.gov.hk/API/Token/Get'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
    }

token = requests.get(tokenUrl, headers=headers).text

url = 'https://www.itf.gov.hk/API/Project/Search'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
    'verificationtoken': token}


page = 1
rows = []
while True:
    payload = {
        'Page': '%s' %page,
        'Programmes[]': 'TVP'}
    
    jsonData = requests.post(url, headers=headers, data=payload).json()
    
    rows  = jsonData['Records']
    page  = 1
    print(f"{len(rows)} of {jsonData['Total']}")
    if len(rows) == jsonData['Total']:
        print('Complete')
        break

df = pd.DataFrame(rows)

Output sample of first 10 pages:

print(df)
      Reference  ... ApprovedAmount
0   TVP/2122/22  ...       161550.0
1   TVP/2120/22  ...       152100.0
2   TVP/2107/22  ...       225750.0
3   TVP/2105/22  ...       183750.0
4   TVP/2103/22  ...       241875.0
..          ...  ...            ...
95  TVP/1826/22  ...       262500.0
96  TVP/1825/22  ...       189600.0
97  TVP/1820/22  ...       152250.0
98  TVP/1819/22  ...       180187.5
99  TVP/1818/22  ...       225750.0

[100 rows x 11 columns]

CodePudding user response：

After inspecting the website, I see no such class projDetailTbl. However, you can use the following code to be able to process the data from the website.

table = soup.find('table', class_='colorTbl')

or you can search on id if there happen to be more classes of the same name.

table = soup.find('table', id='searchResultOutput')

CodePudding user response：

I'm afraid that it could not be done as your site generates the table with javascript. The body of the html that you are getting in your soup is without the data:

<body>
<div id="content"><script src="/filemanager/template/common/js/search.js" type="text/javascript"></script>
<h1 >Project Profile</h1>
<div  id="projectProfile"> 
<table >
<tbody></tbody>
</table>
<div  id="techAreaRemark" style="display: none;">* The primary technology area as indicated by the project coordinator is placed first.</div>
</div></div>
</body>

That is the soup (part of it) that you are getting with :

url = 'https://www.itf.gov.hk/en/project-search/project-profile/index.html?ReferenceNo=TVP/2122/22'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

The content of the page is filled up afterwords with:

<div> id="content"><script src="/filemanager/template/common/js/search.js" type="text/javascript"></script> ...
...
</div>

Your code is probably ok but the page you are scraping is not filled with data yet, when you put a request. Maybe you could try to get it using Selenium.....
Regards...

CodePudding user response：

I used selenium and webdriver_manager to work with javascript execution

To install selenium run pip install selenium and to automatically load the drivers, install webdriver_manager pip install webdriver-manager

Here is my code (worked for me):

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from selenium.common.exceptions import NoSuchElementException
from webdriver_manager.chrome import ChromeDriverManager # automatic webdriver for Chrome browser (can change to your browser)

URL = 'https://www.itf.gov.hk/en/project-search/project-profile/index.html?ReferenceNo=TVP/2122/22'
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Accept": "text/html,application/xhtml xml,application/xml; q=0.9,image/webp,image/apng,*/*;q=0.8"
}

# opening the page and get elements from the table
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options, executable_path=ChromeDriverManager().install())
driver.get(URL)
time.sleep(4) # falling asleep (4 sec) to accurately load the table on the site

# getting the name of the line and its content
data = {}
index_elem = 1
while True:
    # through the lines until we reach the non-existent
    try:
        columns_name = driver.find_element(
            By.XPATH, f'//*[@id="projectProfile"]/table/tbody/tr[{index_elem}]/th').text
        columns_content = driver.find_element(
            By.XPATH, f'//*[@id="projectProfile"]/table/tbody/tr[{index_elem}]/td').text
        data[columns_name] = [columns_content]
        index_elem  = 1
    except NoSuchElementException:
        break

df = pd.DataFrame(data)
print(df)

Output:

  Project Reference  ...    Technological Solution(s)
0       TVP/2122/22  ...  Point-of-Sales (POS) System