How to extract specific html lines (with a flex container) using ironpython?-CodePudding

I am using IronPython 2.7.9.0 on Grasshopper and Rhino to web scrape data from a specific widget on this link: https://vemcount.app/embed/widget/uOCRuLPangWo5fT?locale=en

The code I am using is as follows

import urllib
import os

web = urllib.urlopen(url)
html = web.read()
web.close()

The html output contains all the html code from this link except for the parts I need. When I inspect it on chrome it has a "flex" button next to it such as the following image.

image that summarizes the issue I am facing

Anything that is rooted under the line with a "flex" button does not appear in the scraping result and comes as a blank line.

This is the output html I get:

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Central Library - Duhig North &amp; Link</title>

    <meta charset="utf-8">
    <meta name="google" content="notranslate">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <meta name="csrf-token" content="">
    <link rel="stylesheet" href="/build/app.css?id=2fefc4f9faa59eebcb4b">
    <link rel="stylesheet" href="https://vemcount.app/fonts/hamburg_serial/stylesheet.css">

    <style>
        #embed, #main {
            height: 100vh;
        }

        .vue-grid-item {
            margin-bottom: 0px !important;
        }

        .powered_by {
            position: absolute;
            bottom: 0px;
            right: 0px;
            background-color: rgba(0, 0, 0, 0.18);
            color: #fff;
            padding: 2px 5px;
            font-size: 9px;
        }

        .powered_by:hover, .powered_by:link, .powered_by:visited {
            text-decoration: none;
            display: none;
        }

        .dashboard-widget .relative {
            overflow: hidden !important;
        }

        
    </style>

    <script>
        window.App = {"socketAppKey":"eJSkWUHWpwolvjVcT2ZxUJZXnDpxtRljdZl74fKr","socketCluster":null,"socketHost":"websocket.vemcount.com","socketPort":443,"socketSecurePort":443,"socketDisableStats":true,"socketEncrypted":true,"locale":"en","settings":[{"name":"type","value":"{\"count_in\":\"column\"}"},{"name":"period","value":"[\"yesterday\"]"},{"name":"period_step","value":"hour"},{"name":"hide_datalabel","value":"0"},{"name":"currency","value":"AUD"},{"name":"show_days","value":"[0,1,2,3,4,5,6]"},{"name":"show_months","value":"[1,2,3,4,5,6,7,8,9,10,11,12]"},{"name":"show_hours_from","value":"00:00"},{"name":"show_hours_to","value":"23:45"},{"name":"data_heatmap","value":"blue"},{"name":"weather_metrics","value":"0"},{"name":"first_day_of_week","value":"1"},{"name":"time_format24","value":"time_format24"},{"name":"date_time_format","value":"2"},{"name":"number_grouping","value":","},{"name":"number_decimal","value":"."},{"name":"opening_hours_overlap","value":"0"},{"name":"data_output","value":"count_in"}],"sound":null};
    </script>

    <script src="/build/lang/en.js?v=2022.04.4"></script>

</head>

<body >

<main id="main">
    <div id="embed" >
        
    <div  style="position: absolute;">

        
        
        
        
        
        
        
        
                    <live-inside :embedded="true" :widget="{&quot;id&quot;:81438,&quot;pane_id&quot;:4005,&quot;title&quot;:&quot;Central Library - Duhig North &amp; Link&quot;,&quot;description&quot;:&quot;Live occupancy \/ Seating capacity&quot;,&quot;x&quot;:0,&quot;y&quot;:0,&quot;w&quot;:2,&quot;h&quot;:1,&quot;bg_color&quot;:&quot;red&quot;,&quot;text_color&quot;:&quot;black&quot;,&quot;type&quot;:&quot;live-inside&quot;,&quot;secret&quot;:&quot;uOCRuLPangWo5fT&quot;,&quot;internal&quot;:&quot;VRg4JTIRrtJ7Pwg&quot;,&quot;embeddable&quot;:1,&quot;content&quot;:{&quot;target&quot;:1100,&quot;bidirectional&quot;:true,&quot;target_enable&quot;:true,&quot;prettify&quot;:false,&quot;target_type&quot;:&quot;donut&quot;,&quot;target_donut_hide_metric&quot;:false,&quot;target_donut_target_hide_label&quot;:false,&quot;target_visual_inside_text&quot;:null,&quot;target_visual_available_text&quot;:null,&quot;target_screen_ok_title&quot;:null,&quot;target_screen_ok_text&quot;:null,&quot;target_screen_ok_color&quot;:&quot;#38A169&quot;,&quot;target_screen_ok_image&quot;:-1,&quot;target_screen_warning_title&quot;:null,&quot;target_screen_warning_pe</live-inside>
            
            
                
        
        
        
                
    </div>

    </div>
</main>

<a title=" Vemco Group A/S "  target="_blank"
   href="http://vemcount.com">Powered by
    <b>vemcount.com</b>
</a>

<script src="/build/manifest.js?id=7f2e9aa3431c681a4683"></script>
<script src="/build/vendor.js?id=19867aae3b960cda7d79"></script>
<script src="/build/embed.js?id=2ff0173dd78c5c1f99c6"></script>

</body>
</html>

As you can see it is missing some lines, which are the lines that have a flex button next to them. (btw I have shortended the code that is in so I dont reach the 30000 character limit).

I am interested in the number 311 which changes every 2 seconds in the live link and it can be found in the html code between

<span>311</span>

Is there a way I can get this value, as well as any other value, using IronPython?

P.S. I am a noob in actual coding, that's why I might have issues with terminologies, but have a fair background in visual scripting. Your help is much appreciated. Thanks.

CodePudding user response：

Just in case you had the same query or were struggling with dynamic web scraping. You have to use CPython and install a webscraper such as Playwright or BS Selenium

I used playwright which is far more straightforward and has a very much appreciated inner_html() function which reads straight into the dynamic flex HTML code. Here is the code for reference.

#part of the help to write the script I got from https://stackoverflow.com/questions/64303326/using-playwright-for-python-how-do-i-select-or-find-an-element

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(slow_mo=1000)

    page = browser.new_page()
    page.goto('https://vemcount.app/embed/widget/uOCRuLPangWo5fT')
    central = page.query_selector("p.w-full span");
    print({'central': central.inner_html()})
        
    browser.close()

Afterwards I am trying to run the .py script remotely from Grasshopper through a batch file and read the output through a txt or CSV file from within Grasshopper.

If there is a better way I am more than happy to hear your suggestions.

Yours,

A Beginner in Python. :)