Home > Net >  scraping python requests soraredata
scraping python requests soraredata

Time:07-16

Hello I am trying to retrieve the json of soraredata by this link but it returns me a source code without json. When I put this link in a software called Insomnia it happens to have the json so I think it must be possible with requests? sorry for my english i use the translator.

edit : the link seems to work without the "my_username" so url = "https://www.soraredata.com/api/stats/newFullRankings/all/false/all/7/0/sr_football"

I get a status code 403, I don't know what is missing to get 200?

Thank you

headers = {
    "Host" : "www.soraredata.com",    
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0",
    "Referer" : "https://www.soraredata.com/rankings",
    }

#url = "https://www.soraredata.com/api/stats/newFullRankings/all/false/all/7/{my_username}/0/sr_football"

res = requests.get(url, headers=headers)
html = res.text
#html = json.loads(html)    

print(html)

CodePudding user response:

Here is a solution I got to work.

import http.client
import json
import socket
import ssl
import urllib.request

hostname = "www.soraredata.com"
path = "/api/stats/newFullRankings/all/false/all/7/0/sr_football"
http_msg = "GET {path} HTTP/1.1\r\nHost: {host}\r\nAccept-Encoding: identity\r\nUser-Agent: python-urllib3/1.26.7\r\n\r\n".format(
    host=hostname,
    path=path
).encode("utf-8")

sock = socket.create_connection((hostname, 443), timeout=3.1)
context = ssl.create_default_context()

with sock:
    with context.wrap_socket(sock, server_hostname=hostname) as ssock:
        ssock.sendall(urllib3_msg)
        response = http.client.HTTPResponse(ssock, method="GET")
        response.begin()
        print(response.status, response.reason)
        data = response.read()

resp_data = json.loads(data.decode("utf-8"))

What was perplexing is that the HTTP message I used was the exact same one used by urllib3, as indicated when debugging the following code. (See the this answer for how to set up logging to debug requests, which also works for urllib3.)

Yet, this code gave a 403 HTTP status code.

import urllib3

http = urllib3.PoolManager()

r = http.request(
    "GET",
    "https://www.soraredata.com/api/stats/newFullRankings/all/false/all/7/0/sr_football",
)
assert r.status == 403

Moreover http.client also gave a 403 status code, and it seems to be doing pretty much what I did above: wrap a socket in an SSL context and send the request.

conn = http.client.HTTPSConnection(hostname)
conn.request("GET", path)
res = conn.getresponse()
assert res.status == 403

CodePudding user response:

Thank you ogdenkev!

I also found this but it doesn't always work

import cloudscraper
import json
scraper = cloudscraper.create_scraper()
r = scraper.get(url,).text 
y = json.loads(r)
print (y)
  • Related