I'm working with this URL:
https://fiaresultsandstatistics.motorsportstats.com/results/2021-monaco-grand-prix/session-facts/0976b01f-e26a-420f-a6e9-3371897fc88b?fact=LapTime
so far I've done this to isolate the text I want.
soup = BeautifulSoup(data.text, 'html.parser')
all_scripts = soup.find_all('script')
ftext = all_scripts[5]
Here's an example of what I want to pull.
[
{
"carNumber": "10", "driver":
{
"name": "Pierre Gasly",
"uuid": "pierre-gasly",
"code": "GAS",
"picture": "https:\u002F\u002Fcontent.motorsportstats.com\u002FdriverProfilePicture\u002FdriverProfilePicture-7c13a1b1-029d-4d59-80ad-861a17b16872.jpg",
"type": "Driver"
},
"laps": [
{"lap": 1, "time": 86902},
{"lap": 2, "time": 78737},
{"lap": 3, "time": 77821},
... same for all other laps...
{"lap": 78, "time": 76120}
]
}
]
Then I just have to repeat 19 times. I figured this is too big for regex and don't really know what else I can use to pull this out.
CodePudding user response:
You can use findall() on specific attributes like findall(_class = "car") or findall(id="driver") you will have to inspect the HTML to be specific and you can also use regex to pull specific strings out like this:
def ElementsWithRegexByClass(soup: BeautifulSoup, class_string: str):
return soup.find_all(class_=re.compile(class_string))
and call it like this:
class_string = "driver"
class_elements = ElementsWithRegexByClass(soup, class_string)
You can then use the class_elements to construct a list or something and in the end combined all the attributes you want into a JSON file. I definitely recommend looking through the BeautifulSoup documentation
I get the soup content like this:
def GetMeTheSoup(url):
page = requests.get(url)
return BeautifulSoup(page.content, "html.parser")
so you can assign, soup = GetMeTheSoup(#Your url)
CodePudding user response:
It sounds like you're trying to send a HTTP GET request to that URL. I've taken the liberty of Using Inspect Element
(Usually one of the FN
keys, F11
on Firefox which is what I use) to verify that it returns a JSON - which it does!
Specifically the action you want to take - fact
is specified as a query at the end of the URL: http://...?fact=LapTimes
Python has utilities to work with both HTTP requests returning JSON - namely the requests
package that does both for you. We'll start by importing that and setting up our parameters:
import requests
from requests.exceptions import HTTPError
query_field = "fact"
query_param = "LapTime"
your_url = "https://fiaresultsandstatistics.motorsportstats.com/results/2021-monaco-grand-prix/session-facts/0976b01f-e26a-420f-a6e9-3371897fc88b"
Now that we've got our data, we can create our request URL from this; it fits the format {your_url}?{query_field}={query_param}
which we can use verbatim with a Python fstring
:
request_url = f"{your_url}?{query_field}={query_param}"
We can leverage the requests
package to do the HTTP request for us and return the result as a dict
, and then check that the request was successful:
lap_time_request = requests.get(url=request_url, params={}) # No GET parameters specified
try:
lap_time_request.raise_for_status() # Raises an error if status != 200 (success)
lap_times = lap_time_request.json()
except HTTPError as e: # If status != 200
# handle errors here
# maybe do something with invalid response?
response = e.response
print(f"Could not retrieve response {reponse.status_code=}")
# print(lap_times) => "[{"carNumber":"10","driver":{"name":"Pierre Gasly","uuid":"pierre-..."
EDIT
As an aside before I get to the solution, I was looking through the JavaScript and found this gem in there:
__SECRET_INTERNALS_DO_NOT_USE_OR_YOU_WILL_BE_FIRED: {
ReactCurrentDispatcher: k,
ReactCurrentBatchConfig: T,
ReactCurrentOwner: C,
IsSomeRendererActing: {
current: !1
},
assign: n
}
So I guess somebody might be getting fired unless its a honeypot?
Actual Edit
In this specific example, it looks like the request is actually being referred first - we can get around that by looking at the request header, which looks something like this:
GET /web/3.0.0/sessions/0976b01f-e26a-420f-a6e9-3371897fc88b/lapTimes HTTP/2
Host: fiaproxy.motorsportstats.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0
Accept: application/json, text/plain, */*
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate, br
Origin: https://fiaresultsandstatistics.motorsportstats.com
DNT: 1
Connection: keep-alive
Referer: https://fiaresultsandstatistics.motorsportstats.com/
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-site
TE: trailers
If we save that to some file request.txt
we can use the burpee library, it's one small file that you can just add to your project:
import burpee
# gets data from file if you want to inspect it
# headers, data = burpee.parse_request("request.txt")
# burpee handles doing the get and returns a `requests.Reponse`
lap_time_response = burpee.request("request.txt", https=True)
try:
lap_time_response.raise_for_status() # Raises an error if status != 200 (success)
lap_times = lap_time_response.json()
except HTTPError as e: # If status != 200
# handle errors here
# maybe do something with invalid response?
response = e.response
print(f"Could not retrieve response {reponse.status_code=}")
# print(lap_times) => "[{"carNumber":"10","driver":{"name":"Pierre Gasly","uuid":"pierre-..."