Home > database >  How to extract specific html lines (with a flex container) using ironpython?
How to extract specific html lines (with a flex container) using ironpython?

Time:04-19

I am using IronPython 2.7.9.0 on Grasshopper and Rhino to web scrape data from a specific widget on this link: https://vemcount.app/embed/widget/uOCRuLPangWo5fT?locale=en

The code I am using is as follows

import urllib
import os

web = urllib.urlopen(url)
html = web.read()
web.close()

The html output contains all the html code from this link except for the parts I need. When I inspect it on chrome it has a "flex" button next to it such as the following image.

image that summarizes the issue I am facing

Anything that is rooted under the line with a "flex" button does not appear in the scraping result and comes as a blank line.

This is the output html I get:

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Central Library - Duhig North &amp; Link</title>

    <meta charset="utf-8">
    <meta name="google" content="notranslate">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <meta name="csrf-token" content="">
    <link rel="stylesheet" href="/build/app.css?id=2fefc4f9faa59eebcb4b">
    <link rel="stylesheet" href="https://vemcount.app/fonts/hamburg_serial/stylesheet.css">

    <style>
        #embed, #main {
            height: 100vh;
        }

        .vue-grid-item {
            margin-bottom: 0px !important;
        }

        .powered_by {
            position: absolute;
            bottom: 0px;
            right: 0px;
            background-color: rgba(0, 0, 0, 0.18);
            color: #fff;
            padding: 2px 5px;
            font-size: 9px;
        }

        .powered_by:hover, .powered_by:link, .powered_by:visited {
            text-decoration: none;
            display: none;
        }

        .dashboard-widget .relative {
            overflow: hidden !important;
        }

        
    </style>

    <script>
        window.App = {"socketAppKey":"eJSkWUHWpwolvjVcT2ZxUJZXnDpxtRljdZl74fKr","socketCluster":null,"socketHost":"websocket.vemcount.com","socketPort":443,"socketSecurePort":443,"socketDisableStats":true,"socketEncrypted":true,"locale":"en","settings":[{"name":"type","value":"{\"count_in\":\"column\"}"},{"name":"period","value":"[\"yesterday\"]"},{"name":"period_step","value":"hour"},{"name":"hide_datalabel","value":"0"},{"name":"currency","value":"AUD"},{"name":"show_days","value":"[0,1,2,3,4,5,6]"},{"name":"show_months","value":"[1,2,3,4,5,6,7,8,9,10,11,12]"},{"name":"show_hours_from","value":"00:00"},{"name":"show_hours_to","value":"23:45"},{"name":"data_heatmap","value":"blue"},{"name":"weather_metrics","value":"0"},{"name":"first_day_of_week","value":"1"},{"name":"time_format24","value":"time_format24"},{"name":"date_time_format","value":"2"},{"name":"number_grouping","value":","},{"name":"number_decimal","value":"."},{"name":"opening_hours_overlap","value":"0"},{"name":"data_output","value":"count_in"}],"sound":null};
    </script>

    <script src="/build/lang/en.js?v=2022.04.4"></script>

</head>

<body >

<main id="main">
    <div id="embed" >
        
    <div  style="position: absolute;">

        
        
        
        
        
        
        
        
                    <live-inside :embedded="true" :widget="{&quot;id&quot;:81438,&quot;pane_id&quot;:4005,&quot;title&quot;:&quot;Central Library - Duhig North &amp; Link&quot;,&quot;description&quot;:&quot;Live occupancy \/ Seating capacity&quot;,&quot;x&quot;:0,&quot;y&quot;:0,&quot;w&quot;:2,&quot;h&quot;:1,&quot;bg_color&quot;:&quot;red&quot;,&quot;text_color&quot;:&quot;black&quot;,&quot;type&quot;:&quot;live-inside&quot;,&quot;secret&quot;:&quot;uOCRuLPangWo5fT&quot;,&quot;internal&quot;:&quot;VRg4JTIRrtJ7Pwg&quot;,&quot;embeddable&quot;:1,&quot;content&quot;:{&quot;target&quot;:1100,&quot;bidirectional&quot;:true,&quot;target_enable&quot;:true,&quot;prettify&quot;:false,&quot;target_type&quot;:&quot;donut&quot;,&quot;target_donut_hide_metric&quot;:false,&quot;target_donut_target_hide_label&quot;:false,&quot;target_visual_inside_text&quot;:null,&quot;target_visual_available_text&quot;:null,&quot;target_screen_ok_title&quot;:null,&quot;target_screen_ok_text&quot;:null,&quot;target_screen_ok_color&quot;:&quot;#38A169&quot;,&quot;target_screen_ok_image&quot;:-1,&quot;target_screen_warning_title&quot;:null,&quot;target_screen_warning_pe</live-inside>
            
            
                
        
        
        
                
    </div>

    </div>
</main>

<a title=" Vemco Group A/S "  target="_blank"
   href="http://vemcount.com">Powered by
    <b>vemcount.com</b>
</a>

<script src="/build/manifest.js?id=7f2e9aa3431c681a4683"></script>
<script src="/build/vendor.js?id=19867aae3b960cda7d79"></script>
<script src="/build/embed.js?id=2ff0173dd78c5c1f99c6"></script>

</body>
</html>

As you can see it is missing some lines, which are the lines that have a flex button next to them. (btw I have shortended the code that is in so I dont reach the 30000 character limit).

I am interested in the number 311 which changes every 2 seconds in the live link and it can be found in the html code between

<span>311</span>

Is there a way I can get this value, as well as any other value, using IronPython?

P.S. I am a noob in actual coding, that's why I might have issues with terminologies, but have a fair background in visual scripting. Your help is much appreciated. Thanks.

CodePudding user response:

Just in case you had the same query or were struggling with dynamic web scraping. You have to use CPython and install a webscraper such as Playwright or BS Selenium

I used playwright which is far more straightforward and has a very much appreciated inner_html() function which reads straight into the dynamic flex HTML code. Here is the code for reference.

#part of the help to write the script I got from https://stackoverflow.com/questions/64303326/using-playwright-for-python-how-do-i-select-or-find-an-element

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(slow_mo=1000)

    page = browser.new_page()
    page.goto('https://vemcount.app/embed/widget/uOCRuLPangWo5fT')
    central = page.query_selector("p.w-full span");
    print({'central': central.inner_html()})
        
    browser.close()
 

Afterwards I am trying to run the .py script remotely from Grasshopper through a batch file and read the output through a txt or CSV file from within Grasshopper.

If there is a better way I am more than happy to hear your suggestions.

Yours,

A Beginner in Python. :)

  • Related