I am using Github's GraphQL API to fetch some issue details.
I used Python Requests to fetch the data locally. This is how the output.json looks like
{
"data": {
"viewer": {
"login": "some_user"
},
"repository": {
"issues": {
"edges": [
{
"node": {
"id": "I_kwDOHQ63-s5auKbD",
"title": "test issue 1",
"number": 146,
"createdAt": "2023-01-06T06:39:54Z",
"closedAt": null,
"state": "OPEN",
"updatedAt": "2023-01-06T06:42:00Z",
"comments": {
"edges": [
{
"node": {
"id": "IC_kwDOHQ63-s5R2XCV",
"body": "comment 01"
}
},
{
"node": {
"id": "IC_kwDOHQ63-s5R2XC9",
"body": "comment 02"
}
}
]
},
"labels": {
"edges": []
}
},
"cursor": "Y3Vyc29yOnYyOpHOWrimww=="
},
{
"node": {
"id": "I_kwDOHQ63-s5auKm8",
"title": "test issue 2",
"number": 147,
"createdAt": "2023-01-06T06:40:34Z",
"closedAt": null,
"state": "OPEN",
"updatedAt": "2023-01-06T06:40:34Z",
"comments": {
"edges": []
},
"labels": {
"edges": [
{
"node": {
"name": "food"
}
},
{
"node": {
"name": "healthy"
}
}
]
}
},
"cursor": "Y3Vyc29yOnYyOpHOWripvA=="
}
]
}
}
}
}
The json was put inside a list using
result = response.json()["data"]["repository"]["issues"]["edges"]
And then this list was put inside a DataFrame
import pandas as pd
df = pd.DataFrame (result, columns = ['node', 'cursor'])
df
These are the contents of the data frame
id | title | number | createdAt | closedAt | state | updatedAt | comments | labels |
---|---|---|---|---|---|---|---|---|
I_kwDOHQ63-s5auKbD | test issue 1 | 146 | 2023-01-06T06:39:54Z | None | OPEN | 2023-01-06T06:42:00Z | {'edges': [{'node': {'id': 'IC_kwDOHQ63-s5R2XCV","body": "comment 01"}},{'node': {'id': 'IC_kwDOHQ63-s5R2XC9","body": "comment 02"}}]} | {'edges': []} |
I_kwDOHQ63-s5auKm8 | test issue 2 | 147 | 2023-01-06T06:40:34Z | None | OPEN | 2023-01-06T06:40:34Z | {'edges': []} | {'edges': [{'node': {'name': 'food"}},{'node': {'name': 'healthy"}}]} |
I would like to split/explode the comments
and labels
columns.
The values in these columns are nested dictionaries
I would like there to be as many rows for a single issue, as there are comments
& labels
.
I would like to flatten out the data frame.
So this involves split/explode and concat.
There are several stackoverflow answers that delve on this topic. And I have tried the code from several of them. I can not paste the links to those questions, because stackoverflow marks my question as spam due to many links. But these are the steps I have tried
df3 = df2['comments'].apply(pd.Series)
Drill down further
df4 = df3['edges'].apply(pd.Series)
df4
Drill down further
df5 = df4['node'].apply(pd.Series)
df5
The last statement above gives me the KeyError: 'node'
I understand, this is because node is not a key in the DataFrame.
But how else can i split this dictionary and concatenate all columns back to my issues
row?
This is how I would like the output to look like
id | title | number | createdAt | closedAt | state | updatedAt | comments | labels |
---|---|---|---|---|---|---|---|---|
I_kwDOHQ63-s5auKbD | test issue 1 | 146 | 2023-01-06T06:39:54Z | None | OPEN | 2023-01-06T06:42:00Z | comment 01 | Null |
I_kwDOHQ63-s5auKbD | test issue 1 | 146 | 2023-01-06T06:39:54Z | None | OPEN | 2023-01-06T06:42:00Z | comment 02 | Null |
I_kwDOHQ63-s5auKm8 | test issue 2 | 147 | 2023-01-06T06:40:34Z | None | OPEN | 2023-01-06T06:40:34Z | Null | food |
I_kwDOHQ63-s5auKm8 | test issue 2 | 147 | 2023-01-06T06:40:34Z | None | OPEN | 2023-01-06T06:40:34Z | Null | healthy |
CodePudding user response:
If dct
is your dictionary from the question you can try:
df = pd.DataFrame(d['node'] for d in dct['data']['repository']['issues']['edges'])
df['comments'] = df['comments'].str['edges']
df = df.explode('comments')
df['comments'] = df['comments'].str['node'].str['body']
df['labels'] = df['labels'].str['edges']
df = df.explode('labels')
df['labels'] = df['labels'].str['node'].str['name']
print(df.to_markdown(index=False))
Prints:
id | title | number | createdAt | closedAt | state | updatedAt | comments | labels |
---|---|---|---|---|---|---|---|---|
I_kwDOHQ63-s5auKbD | test issue 1 | 146 | 2023-01-06T06:39:54Z | OPEN | 2023-01-06T06:42:00Z | comment 01 | nan | |
I_kwDOHQ63-s5auKbD | test issue 1 | 146 | 2023-01-06T06:39:54Z | OPEN | 2023-01-06T06:42:00Z | comment 02 | nan | |
I_kwDOHQ63-s5auKm8 | test issue 2 | 147 | 2023-01-06T06:40:34Z | OPEN | 2023-01-06T06:40:34Z | nan | food | |
I_kwDOHQ63-s5auKm8 | test issue 2 | 147 | 2023-01-06T06:40:34Z | OPEN | 2023-01-06T06:40:34Z | nan | healthy |
CodePudding user response:
@andrej-kesely has answered my question.
I have selected his response as the answer for this question.
I am now posting a consolidated script that includes my poor code and andrej's great code.
In this script i want to fetch details from Github's GraphQL API Server.
And put it inside pandas.
Primary source for this script is this gist.
And a major chunk of remaining code is an answer by @andrej-kesely.
Now onto the consolidated script.
First import the necessary packages and set headers
import requests
import json
import pandas as pd
headers = {"Authorization": "token <your_github_personal_access_token>"}
Now define the query that will fetch data from github.
In my particular case, I am fetching issue details form a particular repo
it can be something else for you.
query = """
{
viewer {
login
}
repository(name: "your_github_repo", owner: "your_github_user_name") {
issues(states: OPEN, last: 2) {
edges {
node {
id
title
number
createdAt
closedAt
state
updatedAt
comments(first: 10) {
edges {
node {
id
body
}
}
}
labels(orderBy: {field: NAME, direction: ASC}, first: 10) {
edges {
node {
name
}
}
}
comments(first: 10) {
edges {
node {
id
body
}
}
}
}
cursor
}
}
}
}
"""
Execute the query and save the response
def run_query(query):
request = requests.post('https://api.github.com/graphql', json={'query': query}, headers=headers)
if request.status_code == 200:
return request.json()
else:
raise Exception("Query failed to run by returning code of {}. {}".format(request.status_code, query))
result = run_query(query)
And now is the trickiest part.
In my query response, there are several nested dictionaries.
I would like to split them - more details in my question above.
This magic code from @andrej-kesely does that for you.
df = pd.DataFrame(d['node'] for d in result['data']['repository']['issues']['edges'])
df['comments'] = df['comments'].str['edges']
df = df.explode('comments')
df['comments'] = df['comments'].str['node'].str['body']
df['labels'] = df['labels'].str['edges']
df = df.explode('labels')
df['labels'] = df['labels'].str['node'].str['name']
print(df)