Home > other >  Handle content from a <script> tag in python
Handle content from a <script> tag in python

Time:11-11

I am currently trying to read out the locations of a company. The information about the locations is inside a script tag (json). So I read out the contet inside the corresponding script tag.

This is my code:

sauce = requests.get('https://www.ep.de/store-finder', verify=False, headers = {'User-Agent':'Mozilla/5.0'})
soup1 = BeautifulSoup(sauce.text, features="html.parser")
all_scripts = soup1.find_all('script')[6]
all_scripts.contents

The output is:

['\n\t\twindow.storeFinderComponent = {"center":{"lat":51.165691,"long":10.451526},"bounds":[[55.655085,5.160441],[46.439648,15.666775]],"stores":[{"code":"1238240","lat":51.411572,"long":10.425264,"name":"EP:Schulze","url":"/schulze-breitenworbis","showAsClosed":false,"isBusinessCard":false,"logoUrl":"https://cdn.prod.team-ec.com/logo/retailer/retailerlogo_epde_1238240.png","address":{"street":"Weststraße 6","zip":"37339","town":"Breitenworbis","phone":" 49 (36074) 31193"},"email":"[email protected]","openingHours":[{"day":"Mo.","openingTime":"09:00","closingTime":"18:00","startPauseTime":"13:00","endPauseTime":"14:30"},{"day":"Di.","openingTime":"09:00","closingTime":"18:00","startPauseTime":"13:00","endPauseTime":"14:30"},{"day":"Mi.","openingTime":"09:00","closingTime":"18:00","startPauseTime":"13:00","endPauseTime":"14:30"},...]

I have problems converting the content to a dictionary and reading all lat and long data.

When I try:

data = json.loads(all_scripts.get_text())

all_scripts.get_text() returns an empty list

So i tryed:

data = json.loads(all_scripts.contents)

But then i get an TypeError: the JSON object must be str, bytes or bytearray, not list

I dont know ho to convert the .content method to json:

data = json.loads(str(all_scripts.contents))

JSONDecodeError: Expecting value: line 1 column 2 (char 1)

Can anyone help me?

CodePudding user response:

You could use regex to pull out the json and read that in.

import requests
import re
import json

html = requests.get('https://www.ep.de/store-finder', verify=False, headers = {'User-Agent':'Mozilla/5.0'}).text

pattern = re.compile('window\.storeFinderComponent = ({.*})')
result = pattern.search(html).groups(1)[0]

jsonData = json.loads(result)

CodePudding user response:

You can removed first part of data and then last character of data and then load data to json

import json
data=all_scripts.contents[0]
removed_data=data.replace("\n\t\twindow.storeFinderComponent = ","")
clean_data=data[:-3]
json_data=json.loads(clean_data)

Output:

{'center': {'lat': 51.165691, 'long': 10.451526},
 'bounds': [[55.655085, 5.160441], [46.439648, 15.666775]],
 'stores': [{'code': '1238240',
   'lat': 51.411572,
    ....
  • Related