I want to scrape the date this profile was created from this website. How can I achieve this using python? I tried creating a request for the hasura api to return the date string but I wasn't able to achieve the result I wanted. Any help is appreciated!
CodePudding user response:
Here is the working solution so far.
Code:
import requests
link = 'https://foundation.app/_next/data/0JezQQoyG11uO1zi-CW4m/blog.json'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
resp = s.get(link)
for item in resp.json()['pageProps']['articles']:
print(item['datePosted'])
Output:
2021-10-28
2021-10-26
2021-10-20
2021-10-20
2021-09-27
2021-09-16
2021-09-12
2021-09-07
2021-09-03
2021-09-01
2021-08-30
2021-08-09
2021-08-08
2021-08-03
2021-08-01
2021-07-30
2021-07-28
2021-07-27
2021-07-21
2021-07-14
2021-07-13
2021-07-08
2021-07-07
2021-07-01
2021-06-24
2021-06-17
2021-06-15
2021-06-15
2021-06-10
2021-06-07
2021-06-01
2021-05-25
2021-05-25
2021-05-19
2021-05-17
2021-05-12
2021-05-10
2021-05-05
2021-05-04
2021-05-04
2021-05-03
2021-04-28
2021-04-27
2021-04-26
2021-04-26
2021-04-22
2021-04-21
2021-04-20
2021-04-15
2021-04-14
2021-04-13
2021-04-07
2021-04-01
2021-03-31
2021-03-25
2021-03-24
2021-03-17
2021-03-14
2021-03-12
2021-03-10
2021-03-07
2021-03-02
2021-02-18
2021-02-12
2021-02-01
2021-02-01
2021-01-11
2020-11-18
2020-11-06
2020-11-04
2020-10-30
2020-10-27
2020-10-21
2020-10-20
2020-10-13
2020-08-25
2020-08-03
2020-07-29
2020-07-20
2020-06-25
2020-06-09
2020-05-27
CodePudding user response:
When viewing the page in a browser, logging one's network traffic reveals several XHR HTTP POST requests being made to a GraphQL API, the response of which is JSON and contains the user information you're trying to scrape. All you need to do is imitate such a request.
Since it's GraphQL, we can formulate our own query - let's say we're only interested in retrieving the user's username, and the date on which the account was created:
def get_user():
import requests
url = "https://hasura2.foundation.app/v1/graphql"
headers = {
"user-agent": "Mozilla/5.0"
}
query = """
query hasuraUserProfileByPublicKey($publicKey: String!) {
user: user_by_pk(publicKey: $publicKey) {
...HasuraUserFragment
}
}
fragment HasuraUserFragment on user {
username
createdAt
}
"""
payload = {
"operationName": "hasuraUserProfileByPublicKey",
"query": query,
"variables": {
"publicKey": "0x85293A8CF1E42EEe09B3ad5100C9120C5A13bC0d"
}
}
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
return response.json()["data"]["user"]
def main():
user = get_user()
print("{} joined on {}".format(user["username"], user["createdAt"]))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
knndtrz joined on 2021-05-10T10:20:10.36361