I am using IronPython 2.7.9.0 on Grasshopper and Rhino to web scrape data from a specific widget on this link: https://vemcount.app/embed/widget/uOCRuLPangWo5fT?locale=en
The code I am using is as follows
import urllib
import os
web = urllib.urlopen(url)
html = web.read()
web.close()
The html output contains all the html code from this link except for the parts I need. When I inspect it on chrome it has a "flex" button next to it such as the following image.
image that summarizes the issue I am facing
Anything that is rooted under the line with a "flex" button does not appear in the scraping result and comes as a blank line.
This is the output html I get:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Central Library - Duhig North & Link</title>
<meta charset="utf-8">
<meta name="google" content="notranslate">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta name="csrf-token" content="">
<link rel="stylesheet" href="/build/app.css?id=2fefc4f9faa59eebcb4b">
<link rel="stylesheet" href="https://vemcount.app/fonts/hamburg_serial/stylesheet.css">
<style>
#embed, #main {
height: 100vh;
}
.vue-grid-item {
margin-bottom: 0px !important;
}
.powered_by {
position: absolute;
bottom: 0px;
right: 0px;
background-color: rgba(0, 0, 0, 0.18);
color: #fff;
padding: 2px 5px;
font-size: 9px;
}
.powered_by:hover, .powered_by:link, .powered_by:visited {
text-decoration: none;
display: none;
}
.dashboard-widget .relative {
overflow: hidden !important;
}
</style>
<script>
window.App = {"socketAppKey":"eJSkWUHWpwolvjVcT2ZxUJZXnDpxtRljdZl74fKr","socketCluster":null,"socketHost":"websocket.vemcount.com","socketPort":443,"socketSecurePort":443,"socketDisableStats":true,"socketEncrypted":true,"locale":"en","settings":[{"name":"type","value":"{\"count_in\":\"column\"}"},{"name":"period","value":"[\"yesterday\"]"},{"name":"period_step","value":"hour"},{"name":"hide_datalabel","value":"0"},{"name":"currency","value":"AUD"},{"name":"show_days","value":"[0,1,2,3,4,5,6]"},{"name":"show_months","value":"[1,2,3,4,5,6,7,8,9,10,11,12]"},{"name":"show_hours_from","value":"00:00"},{"name":"show_hours_to","value":"23:45"},{"name":"data_heatmap","value":"blue"},{"name":"weather_metrics","value":"0"},{"name":"first_day_of_week","value":"1"},{"name":"time_format24","value":"time_format24"},{"name":"date_time_format","value":"2"},{"name":"number_grouping","value":","},{"name":"number_decimal","value":"."},{"name":"opening_hours_overlap","value":"0"},{"name":"data_output","value":"count_in"}],"sound":null};
</script>
<script src="/build/lang/en.js?v=2022.04.4"></script>
</head>
<body >
<main id="main">
<div id="embed" >
<div style="position: absolute;">
<live-inside :embedded="true" :widget="{"id":81438,"pane_id":4005,"title":"Central Library - Duhig North & Link","description":"Live occupancy \/ Seating capacity","x":0,"y":0,"w":2,"h":1,"bg_color":"red","text_color":"black","type":"live-inside","secret":"uOCRuLPangWo5fT","internal":"VRg4JTIRrtJ7Pwg","embeddable":1,"content":{"target":1100,"bidirectional":true,"target_enable":true,"prettify":false,"target_type":"donut","target_donut_hide_metric":false,"target_donut_target_hide_label":false,"target_visual_inside_text":null,"target_visual_available_text":null,"target_screen_ok_title":null,"target_screen_ok_text":null,"target_screen_ok_color":"#38A169","target_screen_ok_image":-1,"target_screen_warning_title":null,"target_screen_warning_pe</live-inside>
</div>
</div>
</main>
<a title=" Vemco Group A/S " target="_blank"
href="http://vemcount.com">Powered by
<b>vemcount.com</b>
</a>
<script src="/build/manifest.js?id=7f2e9aa3431c681a4683"></script>
<script src="/build/vendor.js?id=19867aae3b960cda7d79"></script>
<script src="/build/embed.js?id=2ff0173dd78c5c1f99c6"></script>
</body>
</html>
As you can see it is missing some lines, which are the lines that have a flex button next to them. (btw I have shortended the code that is in so I dont reach the 30000 character limit).
I am interested in the number 311 which changes every 2 seconds in the live link and it can be found in the html code between
<span>311</span>
Is there a way I can get this value, as well as any other value, using IronPython?
P.S. I am a noob in actual coding, that's why I might have issues with terminologies, but have a fair background in visual scripting. Your help is much appreciated. Thanks.
CodePudding user response:
Just in case you had the same query or were struggling with dynamic web scraping. You have to use CPython and install a webscraper such as Playwright or BS Selenium
I used playwright which is far more straightforward and has a very much appreciated inner_html()
function which reads straight into the dynamic flex HTML code. Here is the code for reference.
#part of the help to write the script I got from https://stackoverflow.com/questions/64303326/using-playwright-for-python-how-do-i-select-or-find-an-element
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(slow_mo=1000)
page = browser.new_page()
page.goto('https://vemcount.app/embed/widget/uOCRuLPangWo5fT')
central = page.query_selector("p.w-full span");
print({'central': central.inner_html()})
browser.close()
Afterwards I am trying to run the .py script remotely from Grasshopper through a batch file and read the output through a txt or CSV file from within Grasshopper.
If there is a better way I am more than happy to hear your suggestions.
Yours,
A Beginner in Python. :)