I'm a Python newbee and building a webscraper to get data from a site so i can buy electricity when it's cheapest. Problem is the data I need is in a script, can i use Beautiful Soap to get it? I have tried a lot different ways now and could really need some help here. The page i want to scrape is https://www.elbruk.se/timpriser-se3-stockholm and the information i need is in the data list below.
const labels = [
'00:00','01:00','02:00','03:00','04:00','05:00','06:00','07:00','08:00','09:00','10:00','11:00','12:00','13:00','14:00','15:00','16:00','17:00','18:00','19:00','20:00','21:00','22:00','23:00','24:00',];
const data = {
labels: labels,
datasets: [{
stepped:true,
label: 'Idag',
backgroundColor: '#357DA7',
borderColor: '#357DA7',
data: [94.24,91.59,93.52,97.70,103.23,155.15,233.20,269.03,279.92,255.87,231.30,226.70,209.64,174.65,164.84,154.16,134.04,199.48,205.03,204.88,192.49,154.16,74.40,19.47,19.47]
},
(Row 494 in the page code) Is it possible to extract it with Beautiful Soap or am I in a dead end here? Parse it with Json maybe? There is no site with an API for the information either.. (my first hope..)
CodePudding user response:
An easy (but not perfect) solution would be to iterate over all the scripts and find the one that contains "const labels =" after that you just have to trim off the text you dont want and parse the list
CodePudding user response:
BeautifulSoup is not required because in the end you will need alot replace with regex because it not valid json
import requests
import re
import json
response = requests.get(theURL)
data = re.search(r'data\s=\s(\{[^;] )', response.text)
data = data[1].replace("'", '"') # 'Idag' -> "Idag"
data = data.replace(",]", ']') # ,] -> ]
data = re.sub(r"(\w ):", r'"\1":', data) # labels: labels -> "labels": labels
data = re.sub(r":\s?(\w )", r':"\1"', data) # "labels": "labels"
data = json.loads(data)
print(data['datasets'][0]['backgroundColor'])
# print(json.dumps(data, indent=2))
CodePudding user response:
just do this.
use python to download the source code, then parse it with this regex (string below) then take the first match it finds
/^const labels(.*)const config = {type: 'line',data: data,options: {}};/gmis