Home > front end >  How can I scrape the date the profile was created using python
How can I scrape the date the profile was created using python

Time:10-31

I want to scrape the date this profile was created from this website. How can I achieve this using python? I tried creating a request for the hasura api to return the date string but I wasn't able to achieve the result I wanted. Any help is appreciated!

CodePudding user response:

Here is the working solution so far.

Code:

import requests

link = 'https://foundation.app/_next/data/0JezQQoyG11uO1zi-CW4m/blog.json'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    resp = s.get(link)
    for item in resp.json()['pageProps']['articles']:
        print(item['datePosted'])

Output:

2021-10-28
2021-10-26
2021-10-20
2021-10-20
2021-09-27
2021-09-16
2021-09-12
2021-09-07
2021-09-03
2021-09-01
2021-08-30
2021-08-09
2021-08-08
2021-08-03
2021-08-01
2021-07-30
2021-07-28
2021-07-27
2021-07-21
2021-07-14
2021-07-13
2021-07-08
2021-07-07
2021-07-01
2021-06-24
2021-06-17
2021-06-15
2021-06-15
2021-06-10
2021-06-07
2021-06-01
2021-05-25
2021-05-25
2021-05-19
2021-05-17
2021-05-12
2021-05-10
2021-05-05
2021-05-04
2021-05-04
2021-05-03
2021-04-28
2021-04-27
2021-04-26
2021-04-26
2021-04-22
2021-04-21
2021-04-20
2021-04-15
2021-04-14
2021-04-13
2021-04-07
2021-04-01
2021-03-31
2021-03-25
2021-03-24
2021-03-17
2021-03-14
2021-03-12
2021-03-10
2021-03-07
2021-03-02
2021-02-18
2021-02-12
2021-02-01
2021-02-01
2021-01-11
2020-11-18
2020-11-06
2020-11-04
2020-10-30
2020-10-27
2020-10-21
2020-10-20
2020-10-13
2020-08-25
2020-08-03
2020-07-29
2020-07-20
2020-06-25
2020-06-09
2020-05-27

CodePudding user response:

When viewing the page in a browser, logging one's network traffic reveals several XHR HTTP POST requests being made to a GraphQL API, the response of which is JSON and contains the user information you're trying to scrape. All you need to do is imitate such a request.

Since it's GraphQL, we can formulate our own query - let's say we're only interested in retrieving the user's username, and the date on which the account was created:

def get_user():
    import requests

    url = "https://hasura2.foundation.app/v1/graphql"

    headers = {
        "user-agent": "Mozilla/5.0"
    }

    query = """
        query hasuraUserProfileByPublicKey($publicKey: String!) {
            user: user_by_pk(publicKey: $publicKey) {
            ...HasuraUserFragment
            }
        }

        fragment HasuraUserFragment on user {
            username
            createdAt
        }
    """

    payload = {
        "operationName": "hasuraUserProfileByPublicKey",
        "query": query,
        "variables": {
            "publicKey": "0x85293A8CF1E42EEe09B3ad5100C9120C5A13bC0d"
        }
    }

    response = requests.post(url, headers=headers, json=payload)
    response.raise_for_status()

    return response.json()["data"]["user"]


def main():
    user = get_user()
    print("{} joined on {}".format(user["username"], user["createdAt"]))
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

knndtrz joined on 2021-05-10T10:20:10.36361
  • Related