Home > Mobile >  Extract array values from JavaScript with Beautiful Soup
Extract array values from JavaScript with Beautiful Soup

Time:12-16

I’m trying to build a scraper in Python that gets a variable from JavaScript code within the HTML of a webpage. This variable changes over time. Here is the JavaScript code; I need the first number of the yValues variable:

jQuery(document).ready(function() {
  var draw = true;
  
  if ("Biblioteca di Ingegneria" == "") {
    draw = false;
  }
  
  if (draw) {
    var yValues = [
        "28",
        "100"
      ];
    var Titolo = "Biblioteca di Ingegneria";
    var sottoTitolo = "Posti Totali: 128";
    var barColors = [
        "#167d21",
        "#ed2135"
      ];
    var xValues = [
        "Liberi (28)",
        "Occupati (100)"
      ];
    
    new Chart("InOutChart", {
      type: "pie",
      data: {
        labels: xValues,
        datasets: [
          {
            backgroundColor: barColors,
            data: yValues
          }
        ]
      },
      options: {
        plugins: {
          title: {
            display: true,
            text: Titolo,
            font: {
              size: 25,
              style: 'normal',
              lineHeight: 1.2
            },
            // padding: {
            //   top: 10,
            //   bottom: 30
            // }
          },
          subtitle: {
            display: true,
            text: sottoTitolo,
            font: {
              size: 20,
              style: 'normal',
              lineHeight: 1.2
            },
            padding: {
              bottom: 30
            }
          },
          legend: {
            display: true,
            position: "bottom",
            labels: {
              font: {
                size: 20,
                style: 'normal',
                lineHeight: 1.2
              }
            }
          }
        },
        responsive: true,
        maintainAspectRatio: false,
        scales: {
          yAxes: [
            {
              display: true,
              ticks: {
                beginAtZero: true
              }
            }
          ]
        }
      }
    });
  }
});

This is the best I could do:

from bs4 import BeautifulSoup
import requests

# Make a GET request to the URL of the web page.
base_url = 'https://qrbiblio.unipi.it/Home/Chart?IdCat=a96d84ba-46e8-47a1-b947-ab98a8746d6f'
response = requests.get(base_url)

# Parse the HTML content of the page.
soup = BeautifulSoup(response.text, "html.parser")

# Find all the `<script>` elements on the page.
scripts = soup.find_all("script")

# Get the 8th `<script>` element.
script8 = scripts[7]

# Transform the 8th `<script>` into a string.
script8_txt = "".join(script8)

# Get the useful string from the 8th `<script>`.
usefull_txt = script8_txt[248:251]
        
# Get the int from the string.
pl = int("".join(filter(str.isdigit, usefull_txt)))

print(pl)

This works, but I want to automatically parse the JavaScript code to find the variable and get its value, because as you can see I manually checked the position of the characters that I needed. I’m looking for a better solution because I’m planning to use this code for other similar webpages, but the position of the variable changes every time. Last information: I want to put this Python code in an Alexa skill, so I don’t know if Selenium package will work well.

CodePudding user response:

Try this:

import requests
from bs4 import BeautifulSoup

base_url = 'https://qrbiblio.unipi.it/Home/Chart?IdCat=a96d84ba-46e8-47a1-b947-ab98a8746d6f'
response = requests.get(base_url)

script = BeautifulSoup(response.text, "html.parser").find_all("script")[7].string
print(script.strip().split("var yValues = ")[1].split(";")[0])

Output:

["30","99"]
  • Related