on the following page below there is as Data source a json link: https://www.sec.gov/os/webmaster-faq#user-agent
but the header "HOST" lead to the "404 page not found" ...
but this header works fine:
headers = {
"User-Agent": "jo boulement [email protected]",
"Accept-Encoding": "gzip, deflate"
}
crazy! because the documentation says something else :(
CodePudding user response:
A web server checks the headers that you send in your request and might decide to return an error page if you don't include certain headers. In this case, it looks like they return an error if you don't include a valid user agent.
This works for me:
import requests
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}
url = "https://data.sec.gov/submissions/CIK0001067983.json"
payload={}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
CodePudding user response:
The HTML that gets returned includes this <script>
tag:
<script src="/files/js/js_DkdESgtfPfV7guog-Lhz7nda0K-ISZe0-gHU4CF6Wo0.js"></script>
My guess is that the script referenced by the tag is what causes the JSON data to be returned. A browser will run that script as part of rendering the HTML. The Requests
package doesn't do this. It just returns the raw HTML. You might need to use something like Puppeteer or Selenium to get the JSON via that URL.