I am trying to web scrape the data from this government website: https://www.itf.gov.hk/en/project-search/search-result/index.html?isAdvSearch=1&Programmes=TVP However, after reading a lot about web scrapping, following Youtube video, I still can’t do it. Can someone please help?
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.itf.gov.hk/en/project-search/project-profile/index.html?ReferenceNo=TVP/2122/22'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
soup
table = soup.find('table',{'class':'colorTbl projDetailTbl'})
headers=[]
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
df = pd.DataFrame(columns=headers)
for row in table.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
The table didn't show at all. The result is none. Please help me.
CodePudding user response:
The table is rendered through javascript, and the data returned through an api. You need to get the data from the source.
Code:
import pandas as pd
import requests
tokenUrl = 'https://www.itf.gov.hk/API/Token/Get'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
token = requests.get(tokenUrl, headers=headers).text
url = 'https://www.itf.gov.hk/API/Project/Search'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
'verificationtoken': token}
page = 1
rows = []
while True:
payload = {
'Page': '%s' %page,
'Programmes[]': 'TVP'}
jsonData = requests.post(url, headers=headers, data=payload).json()
rows = jsonData['Records']
page = 1
print(f"{len(rows)} of {jsonData['Total']}")
if len(rows) == jsonData['Total']:
print('Complete')
break
df = pd.DataFrame(rows)
Output sample of first 10 pages:
print(df)
Reference ... ApprovedAmount
0 TVP/2122/22 ... 161550.0
1 TVP/2120/22 ... 152100.0
2 TVP/2107/22 ... 225750.0
3 TVP/2105/22 ... 183750.0
4 TVP/2103/22 ... 241875.0
.. ... ... ...
95 TVP/1826/22 ... 262500.0
96 TVP/1825/22 ... 189600.0
97 TVP/1820/22 ... 152250.0
98 TVP/1819/22 ... 180187.5
99 TVP/1818/22 ... 225750.0
[100 rows x 11 columns]
CodePudding user response:
After inspecting the website, I see no such class projDetailTbl
. However, you can use the following code to be able to process the data from the website.
table = soup.find('table', class_='colorTbl')
or you can search on id
if there happen to be more classes of the same name.
table = soup.find('table', id='searchResultOutput')
CodePudding user response:
I'm afraid that it could not be done as your site generates the table with javascript. The body of the html that you are getting in your soup is without the data:
<body>
<div id="content"><script src="/filemanager/template/common/js/search.js" type="text/javascript"></script>
<h1 >Project Profile</h1>
<div id="projectProfile">
<table >
<tbody></tbody>
</table>
<div id="techAreaRemark" style="display: none;">* The primary technology area as indicated by the project coordinator is placed first.</div>
</div></div>
</body>
That is the soup (part of it) that you are getting with :
url = 'https://www.itf.gov.hk/en/project-search/project-profile/index.html?ReferenceNo=TVP/2122/22'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
The content of the page is filled up afterwords with:
<div> id="content"><script src="/filemanager/template/common/js/search.js" type="text/javascript"></script> ...
...
</div>
Your code is probably ok but the page you are scraping is not filled with data yet, when you put a request.
Maybe you could try to get it using Selenium.....
Regards...
CodePudding user response:
I used selenium and webdriver_manager to work with javascript execution
To install selenium run pip install selenium
and to automatically load the drivers, install webdriver_manager pip install webdriver-manager
Here is my code (worked for me):
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from selenium.common.exceptions import NoSuchElementException
from webdriver_manager.chrome import ChromeDriverManager # automatic webdriver for Chrome browser (can change to your browser)
URL = 'https://www.itf.gov.hk/en/project-search/project-profile/index.html?ReferenceNo=TVP/2122/22'
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
"Accept": "text/html,application/xhtml xml,application/xml; q=0.9,image/webp,image/apng,*/*;q=0.8"
}
# opening the page and get elements from the table
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options, executable_path=ChromeDriverManager().install())
driver.get(URL)
time.sleep(4) # falling asleep (4 sec) to accurately load the table on the site
# getting the name of the line and its content
data = {}
index_elem = 1
while True:
# through the lines until we reach the non-existent
try:
columns_name = driver.find_element(
By.XPATH, f'//*[@id="projectProfile"]/table/tbody/tr[{index_elem}]/th').text
columns_content = driver.find_element(
By.XPATH, f'//*[@id="projectProfile"]/table/tbody/tr[{index_elem}]/td').text
data[columns_name] = [columns_content]
index_elem = 1
except NoSuchElementException:
break
df = pd.DataFrame(data)
print(df)
Output:
Project Reference ... Technological Solution(s)
0 TVP/2122/22 ... Point-of-Sales (POS) System