How can to get the JSON out of webpage?-CodePudding

So I'm trying to parse data out of this webpage

But I don't need the whole dataset, I just need:

The operator name (Google, CloudFlare, etc.)
The description (Google 'Argon2022' log, Google 'Argon2023' log, etc.)
The logIDs (KXm 8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=)

I tried to write some code but I'm just a beginner at webscraping, so was wondering if anyone could help. Here is my attempted code, I tried using lxml and requests library.

import requests
from lxml import html

page = requests.get('https://chromium.googlesource.com/chromium/src/ /main/components/certificate_transparency/data/log_list.json')
tree = html.fromstring(page.content)

#This will create a list of operators:
operators = tree.xpath('//span[@]/text()')

print('Operators: ',operators)

My hope is to have an end result that looks like the JSON on the website minus all the unneeded info so operators:

[
  { "name": "Google",
    "logs": [
      { description: "Google Argon2022 log",
        log_id: "KXm 8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=" }, 
      { description: "GoogleArgon2023 log",
        log_id: "6D7Q2j71BjUy51covIlryQPTy9ERa zraeF3fW0GvW4=" }
  }
  ....
  { "name": "CloudFlare",
    "logs": [ ... ]
  }
]

CodePudding user response：

First, you want to access the raw file, and not the UI. Just like Kache mentioned, you can get the JSON using:

resp = requests.get('https://chromium.googlesource.com/chromium/src/ /main/components/certificate_transparency/data/log_list.json?format=TEXT')
obj = json.loads(base64.decodebytes(resp.text.encode()))

Then, you can use the following script to extract only the data you want:

import requests
import json
import base64

def extract_log(log):
    keys = [ 'description', 'log_id' ]
    return { key: log[key] for key in keys }

def extract_logs(logs):
    return [ extract_log(log) for log in logs ]

def extract_operator(operator):
    return {
        'name': operator['name'],
        'logs': extract_logs(operator['logs'])
    }

def extract_certificates(obj):
    return [ extract_operator(operator) for operator in obj['operators'] ]

def scrape_certificates(url):
    resp = requests.get(url)
    obj = json.loads(base64.decodebytes(resp.text.encode()))
    return extract_certificates(obj)

def main():
    out = scrape_certificates('https://chromium.googlesource.com/chromium/src/ /main/components/certificate_transparency/data/log_list.json?format=TEXT')
    print(json.dumps(out, indent=4))

if __name__ == '__main__':
    main()

CodePudding user response：

There is a link at the bottom right that lets you download the file directly: https://chromium.googlesource.com/chromium/src/ /main/components/certificate_transparency/data/log_list.json?format=JSON

Which lets you avoid HTML parsing altogether.

Here's Python code to extract it as a dict:

resp = requests.get('https://chromium.googlesource.com/chromium/src/ /main/components/certificate_transparency/data/log_list.json?format=TEXT')
js = json.loads(base64.decodebytes(resp.text.encode()))

What remains of your question involves JSON and dict traversal and basic coding, which you should be able to find answers in other questions.