Home > other >  Webscraping a script with Beautiful Soap
Webscraping a script with Beautiful Soap

Time:04-26

I'm a Python newbee and building a webscraper to get data from a site so i can buy electricity when it's cheapest. Problem is the data I need is in a script, can i use Beautiful Soap to get it? I have tried a lot different ways now and could really need some help here. The page i want to scrape is https://www.elbruk.se/timpriser-se3-stockholm and the information i need is in the data list below.

const labels = [
'00:00','01:00','02:00','03:00','04:00','05:00','06:00','07:00','08:00','09:00','10:00','11:00','12:00','13:00','14:00','15:00','16:00','17:00','18:00','19:00','20:00','21:00','22:00','23:00','24:00',];
const data = {
    labels: labels,
    datasets: [{
        stepped:true,
        label: 'Idag',
        backgroundColor: '#357DA7',
        borderColor: '#357DA7',
        data: [94.24,91.59,93.52,97.70,103.23,155.15,233.20,269.03,279.92,255.87,231.30,226.70,209.64,174.65,164.84,154.16,134.04,199.48,205.03,204.88,192.49,154.16,74.40,19.47,19.47]
    },

(Row 494 in the page code) Is it possible to extract it with Beautiful Soap or am I in a dead end here? Parse it with Json maybe? There is no site with an API for the information either.. (my first hope..)

CodePudding user response:

An easy (but not perfect) solution would be to iterate over all the scripts and find the one that contains "const labels =" after that you just have to trim off the text you dont want and parse the list

CodePudding user response:

BeautifulSoup is not required because in the end you will need alot replace with regex because it not valid json

import requests
import re
import json

response = requests.get(theURL)
data = re.search(r'data\s=\s(\{[^;] )', response.text)
data = data[1].replace("'", '"') # 'Idag' -> "Idag"
data = data.replace(",]", ']') # ,] -> ]
data = re.sub(r"(\w ):", r'"\1":', data) # labels: labels -> "labels": labels
data = re.sub(r":\s?(\w )", r':"\1"', data) # "labels": "labels"
data = json.loads(data)

print(data['datasets'][0]['backgroundColor'])

# print(json.dumps(data, indent=2))

CodePudding user response:

just do this.

use python to download the source code, then parse it with this regex (string below) then take the first match it finds

/^const labels(.*)const config = {type: 'line',data: data,options: {}};/gmis

example here

  • Related