Home > OS >  How to extract this dictionary using Beautiful Soup
How to extract this dictionary using Beautiful Soup

Time:12-14

I want to get to variable last element of dictionary (pasted below), it's in another dictionary "offers", and i have no clue how to extract it.

    html = s.get(url=url, headers=headers, verify=False, timeout=15)
    soup = BeautifulSoup(html.text, 'html.parser')
    products = soup.find_all('script', {'type': "application/ld json"})
{"@context":"http://schema.org","@type":"Product","aggregateRating":{"@type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"@type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"@type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"@type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"} (...)

CodePudding user response:

As mentioned decode the via BeautifulSoup extracted string with json.loads():

import json

products = '{"@context":"http://schema.org","@type":"Product","aggregateRating":{"@type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"@type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"@type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"@type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"}]}'

products = json.loads(products)

To get the last element (dict) in offers:

products['offers'][-1]

Output:

{'@type': 'Offer',
 'availability': 'http://schema.org/OutOfStock',
 'price': '489',
 'priceCurrency': 'PLN',
 'sku': 'NE215O06U-A110002000',
 'url': '/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html'}

Example

In your special case you also have to replace('"','"') first:

from bs4 import BeautifulSoup
import requests, json
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae", 
    "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
html = requests.get('https://www.zalando.de/new-balance-550-unisex-sneaker-low-whitered-ne215o06u-a11.html', headers=headers)
soup = BeautifulSoup(html.content, 'lxml')

jsonData = json.loads(soup.select_one('script[type="application/ld json"]').text.replace('"','"'))

jsonData['offers'][-1]
  • Related