So I'm trying to parse data out of this webpage
But I don't need the whole dataset, I just need:
- The operator name (Google, CloudFlare, etc.)
- The description (Google 'Argon2022' log, Google 'Argon2023' log, etc.)
- The logIDs (
KXm 8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=
)
I tried to write some code but I'm just a beginner at webscraping, so was wondering if anyone could help. Here is my attempted code, I tried using lxml and requests library.
import requests
from lxml import html
page = requests.get('https://chromium.googlesource.com/chromium/src/ /main/components/certificate_transparency/data/log_list.json')
tree = html.fromstring(page.content)
#This will create a list of operators:
operators = tree.xpath('//span[@]/text()')
print('Operators: ',operators)
My hope is to have an end result that looks like the JSON on the website minus all the unneeded info so operators:
[
{ "name": "Google",
"logs": [
{ description: "Google Argon2022 log",
log_id: "KXm 8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=" },
{ description: "GoogleArgon2023 log",
log_id: "6D7Q2j71BjUy51covIlryQPTy9ERa zraeF3fW0GvW4=" }
}
....
{ "name": "CloudFlare",
"logs": [ ... ]
}
]
CodePudding user response:
First, you want to access the raw file, and not the UI. Just like Kache mentioned, you can get the JSON using:
resp = requests.get('https://chromium.googlesource.com/chromium/src/ /main/components/certificate_transparency/data/log_list.json?format=TEXT')
obj = json.loads(base64.decodebytes(resp.text.encode()))
Then, you can use the following script to extract only the data you want:
import requests
import json
import base64
def extract_log(log):
keys = [ 'description', 'log_id' ]
return { key: log[key] for key in keys }
def extract_logs(logs):
return [ extract_log(log) for log in logs ]
def extract_operator(operator):
return {
'name': operator['name'],
'logs': extract_logs(operator['logs'])
}
def extract_certificates(obj):
return [ extract_operator(operator) for operator in obj['operators'] ]
def scrape_certificates(url):
resp = requests.get(url)
obj = json.loads(base64.decodebytes(resp.text.encode()))
return extract_certificates(obj)
def main():
out = scrape_certificates('https://chromium.googlesource.com/chromium/src/ /main/components/certificate_transparency/data/log_list.json?format=TEXT')
print(json.dumps(out, indent=4))
if __name__ == '__main__':
main()
CodePudding user response:
There is a link at the bottom right that lets you download the file directly: https://chromium.googlesource.com/chromium/src/ /main/components/certificate_transparency/data/log_list.json?format=JSON
Which lets you avoid HTML parsing altogether.
Here's Python code to extract it as a dict
:
resp = requests.get('https://chromium.googlesource.com/chromium/src/ /main/components/certificate_transparency/data/log_list.json?format=TEXT')
js = json.loads(base64.decodebytes(resp.text.encode()))
What remains of your question involves JSON and dict
traversal and basic coding, which you should be able to find answers in other questions.