I’m trying to build a scraper in Python that gets a variable from JavaScript code within the HTML of a webpage.
This variable changes over time.
Here is the JavaScript code; I need the first number of the yValues
variable:
jQuery(document).ready(function() {
var draw = true;
if ("Biblioteca di Ingegneria" == "") {
draw = false;
}
if (draw) {
var yValues = [
"28",
"100"
];
var Titolo = "Biblioteca di Ingegneria";
var sottoTitolo = "Posti Totali: 128";
var barColors = [
"#167d21",
"#ed2135"
];
var xValues = [
"Liberi (28)",
"Occupati (100)"
];
new Chart("InOutChart", {
type: "pie",
data: {
labels: xValues,
datasets: [
{
backgroundColor: barColors,
data: yValues
}
]
},
options: {
plugins: {
title: {
display: true,
text: Titolo,
font: {
size: 25,
style: 'normal',
lineHeight: 1.2
},
// padding: {
// top: 10,
// bottom: 30
// }
},
subtitle: {
display: true,
text: sottoTitolo,
font: {
size: 20,
style: 'normal',
lineHeight: 1.2
},
padding: {
bottom: 30
}
},
legend: {
display: true,
position: "bottom",
labels: {
font: {
size: 20,
style: 'normal',
lineHeight: 1.2
}
}
}
},
responsive: true,
maintainAspectRatio: false,
scales: {
yAxes: [
{
display: true,
ticks: {
beginAtZero: true
}
}
]
}
}
});
}
});
This is the best I could do:
from bs4 import BeautifulSoup
import requests
# Make a GET request to the URL of the web page.
base_url = 'https://qrbiblio.unipi.it/Home/Chart?IdCat=a96d84ba-46e8-47a1-b947-ab98a8746d6f'
response = requests.get(base_url)
# Parse the HTML content of the page.
soup = BeautifulSoup(response.text, "html.parser")
# Find all the `<script>` elements on the page.
scripts = soup.find_all("script")
# Get the 8th `<script>` element.
script8 = scripts[7]
# Transform the 8th `<script>` into a string.
script8_txt = "".join(script8)
# Get the useful string from the 8th `<script>`.
usefull_txt = script8_txt[248:251]
# Get the int from the string.
pl = int("".join(filter(str.isdigit, usefull_txt)))
print(pl)
This works, but I want to automatically parse the JavaScript code to find the variable and get its value, because as you can see I manually checked the position of the characters that I needed. I’m looking for a better solution because I’m planning to use this code for other similar webpages, but the position of the variable changes every time. Last information: I want to put this Python code in an Alexa skill, so I don’t know if Selenium package will work well.
CodePudding user response:
Try this:
import requests
from bs4 import BeautifulSoup
base_url = 'https://qrbiblio.unipi.it/Home/Chart?IdCat=a96d84ba-46e8-47a1-b947-ab98a8746d6f'
response = requests.get(base_url)
script = BeautifulSoup(response.text, "html.parser").find_all("script")[7].string
print(script.strip().split("var yValues = ")[1].split(";")[0])
Output:
["30","99"]