Home > front end >  Pulling dictionaries from html file
Pulling dictionaries from html file


I'm working with this URL:


so far I've done this to isolate the text I want.

soup = BeautifulSoup(data.text, 'html.parser')
all_scripts = soup.find_all('script')
ftext = all_scripts[5]

Here's an example of what I want to pull.

        "carNumber": "10", "driver":
            "name": "Pierre Gasly",
            "uuid": "pierre-gasly",
            "code": "GAS",
            "picture": "https:\u002F\u002Fcontent.motorsportstats.com\u002FdriverProfilePicture\u002FdriverProfilePicture-7c13a1b1-029d-4d59-80ad-861a17b16872.jpg",
            "type": "Driver"
        "laps": [
            {"lap": 1, "time": 86902},
            {"lap": 2, "time": 78737},
            {"lap": 3, "time": 77821},
            ... same for all other laps...

            {"lap": 78, "time": 76120}

Then I just have to repeat 19 times. I figured this is too big for regex and don't really know what else I can use to pull this out.

CodePudding user response:

You can use findall() on specific attributes like findall(_class = "car") or findall(id="driver") you will have to inspect the HTML to be specific and you can also use regex to pull specific strings out like this:

def ElementsWithRegexByClass(soup: BeautifulSoup, class_string: str):
    return soup.find_all(class_=re.compile(class_string))

and call it like this:

class_string = "driver"
class_elements = ElementsWithRegexByClass(soup, class_string)

You can then use the class_elements to construct a list or something and in the end combined all the attributes you want into a JSON file. I definitely recommend looking through the BeautifulSoup documentation

I get the soup content like this:

def GetMeTheSoup(url):
     page = requests.get(url)
     return BeautifulSoup(page.content, "html.parser")

so you can assign, soup = GetMeTheSoup(#Your url)

CodePudding user response:

It sounds like you're trying to send a HTTP GET request to that URL. I've taken the liberty of Using Inspect Element (Usually one of the FN keys, F11 on Firefox which is what I use) to verify that it returns a JSON - which it does!

Specifically the action you want to take - fact is specified as a query at the end of the URL: http://...?fact=LapTimes

Python has utilities to work with both HTTP requests returning JSON - namely the requests package that does both for you. We'll start by importing that and setting up our parameters:

import requests
from requests.exceptions import HTTPError

query_field = "fact"
query_param = "LapTime"
your_url = "https://fiaresultsandstatistics.motorsportstats.com/results/2021-monaco-grand-prix/session-facts/0976b01f-e26a-420f-a6e9-3371897fc88b"

Now that we've got our data, we can create our request URL from this; it fits the format {your_url}?{query_field}={query_param} which we can use verbatim with a Python fstring:

request_url = f"{your_url}?{query_field}={query_param}"

We can leverage the requests package to do the HTTP request for us and return the result as a dict, and then check that the request was successful:

lap_time_request = requests.get(url=request_url, params={}) # No GET parameters specified

    lap_time_request.raise_for_status() # Raises an error if status != 200 (success)
    lap_times = lap_time_request.json()
except HTTPError as e: # If status != 200
    # handle errors here
    # maybe do something with invalid response?
    response = e.response
    print(f"Could not retrieve response {reponse.status_code=}")

# print(lap_times) => "[{"carNumber":"10","driver":{"name":"Pierre Gasly","uuid":"pierre-..."


As an aside before I get to the solution, I was looking through the JavaScript and found this gem in there:

    ReactCurrentDispatcher: k,
    ReactCurrentBatchConfig: T,
    ReactCurrentOwner: C,
    IsSomeRendererActing: {
        current: !1
    assign: n

So I guess somebody might be getting fired unless its a honeypot?

Actual Edit

In this specific example, it looks like the request is actually being referred first - we can get around that by looking at the request header, which looks something like this:

GET /web/3.0.0/sessions/0976b01f-e26a-420f-a6e9-3371897fc88b/lapTimes HTTP/2
Host: fiaproxy.motorsportstats.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0
Accept: application/json, text/plain, */*
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate, br
Origin: https://fiaresultsandstatistics.motorsportstats.com
DNT: 1
Connection: keep-alive
Referer: https://fiaresultsandstatistics.motorsportstats.com/
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-site
TE: trailers

If we save that to some file request.txt we can use the burpee library, it's one small file that you can just add to your project:

import burpee

# gets data from file if you want to inspect it
# headers, data = burpee.parse_request("request.txt")
# burpee handles doing the get and returns a `requests.Reponse`
lap_time_response = burpee.request("request.txt", https=True)

    lap_time_response.raise_for_status() # Raises an error if status != 200 (success)
    lap_times = lap_time_response.json()
except HTTPError as e: # If status != 200
    # handle errors here
    # maybe do something with invalid response?
    response = e.response
    print(f"Could not retrieve response {reponse.status_code=}")

# print(lap_times) => "[{"carNumber":"10","driver":{"name":"Pierre Gasly","uuid":"pierre-..."
  • Related