Home > database >  How to flatten dict in a DataFrame & concatenate all resultant rows
How to flatten dict in a DataFrame & concatenate all resultant rows

Time:01-06

I am using Github's GraphQL API to fetch some issue details.

I used Python Requests to fetch the data locally. This is how the output.json looks like

{
    "data": {
        "viewer": {
            "login": "some_user"
        },
        "repository": {
            "issues": {
                "edges": [
                    {
                        "node": {
                            "id": "I_kwDOHQ63-s5auKbD",
                            "title": "test issue 1",
                            "number": 146,
                            "createdAt": "2023-01-06T06:39:54Z",
                            "closedAt": null,
                            "state": "OPEN",
                            "updatedAt": "2023-01-06T06:42:00Z",
                            "comments": {
                                "edges": [
                                    {
                                        "node": {
                                            "id": "IC_kwDOHQ63-s5R2XCV",
                                            "body": "comment 01"
                                        }
                                    },
                                    {
                                        "node": {
                                            "id": "IC_kwDOHQ63-s5R2XC9",
                                            "body": "comment 02"
                                        }
                                    }
                                ]
                            },
                            "labels": {
                                "edges": []
                            }
                        },
                        "cursor": "Y3Vyc29yOnYyOpHOWrimww=="
                    },
                    {
                        "node": {
                            "id": "I_kwDOHQ63-s5auKm8",
                            "title": "test issue 2",
                            "number": 147,
                            "createdAt": "2023-01-06T06:40:34Z",
                            "closedAt": null,
                            "state": "OPEN",
                            "updatedAt": "2023-01-06T06:40:34Z",
                            "comments": {
                                "edges": []
                            },
                            "labels": {
                                "edges": [
                                    {
                                        "node": {
                                            "name": "food"
                                        }
                                    },
                                    {
                                        "node": {
                                            "name": "healthy"
                                        }
                                    }
                                ]
                            }
                        },
                        "cursor": "Y3Vyc29yOnYyOpHOWripvA=="
                    }
                ]
            }
        }
    }
}

The json was put inside a list using

result = response.json()["data"]["repository"]["issues"]["edges"]

And then this list was put inside a DataFrame

import pandas as pd
df = pd.DataFrame (result, columns = ['node', 'cursor'])
df

These are the contents of the data frame

id title number createdAt closedAt state updatedAt comments labels
I_kwDOHQ63-s5auKbD test issue 1 146 2023-01-06T06:39:54Z None OPEN 2023-01-06T06:42:00Z {'edges': [{'node': {'id': 'IC_kwDOHQ63-s5R2XCV","body": "comment 01"}},{'node': {'id': 'IC_kwDOHQ63-s5R2XC9","body": "comment 02"}}]} {'edges': []}
I_kwDOHQ63-s5auKm8 test issue 2 147 2023-01-06T06:40:34Z None OPEN 2023-01-06T06:40:34Z {'edges': []} {'edges': [{'node': {'name': 'food"}},{'node': {'name': 'healthy"}}]}

I would like to split/explode the comments and labels columns. The values in these columns are nested dictionaries

I would like there to be as many rows for a single issue, as there are comments & labels. I would like to flatten out the data frame. So this involves split/explode and concat.

There are several stackoverflow answers that delve on this topic. And I have tried the code from several of them. I can not paste the links to those questions, because stackoverflow marks my question as spam due to many links. But these are the steps I have tried

df3 = df2['comments'].apply(pd.Series)

Drill down further

df4 = df3['edges'].apply(pd.Series)
df4

Drill down further

df5 = df4['node'].apply(pd.Series)
df5

The last statement above gives me the KeyError: 'node' I understand, this is because node is not a key in the DataFrame.

But how else can i split this dictionary and concatenate all columns back to my issues row? This is how I would like the output to look like

id title number createdAt closedAt state updatedAt comments labels
I_kwDOHQ63-s5auKbD test issue 1 146 2023-01-06T06:39:54Z None OPEN 2023-01-06T06:42:00Z comment 01 Null
I_kwDOHQ63-s5auKbD test issue 1 146 2023-01-06T06:39:54Z None OPEN 2023-01-06T06:42:00Z comment 02 Null
I_kwDOHQ63-s5auKm8 test issue 2 147 2023-01-06T06:40:34Z None OPEN 2023-01-06T06:40:34Z Null food
I_kwDOHQ63-s5auKm8 test issue 2 147 2023-01-06T06:40:34Z None OPEN 2023-01-06T06:40:34Z Null healthy

CodePudding user response:

If dct is your dictionary from the question you can try:

df = pd.DataFrame(d['node'] for d in dct['data']['repository']['issues']['edges'])
df['comments'] = df['comments'].str['edges']
df = df.explode('comments')
df['comments'] = df['comments'].str['node'].str['body']

df['labels'] = df['labels'].str['edges']
df = df.explode('labels')
df['labels'] = df['labels'].str['node'].str['name']

print(df.to_markdown(index=False))

Prints:

id title number createdAt closedAt state updatedAt comments labels
I_kwDOHQ63-s5auKbD test issue 1 146 2023-01-06T06:39:54Z OPEN 2023-01-06T06:42:00Z comment 01 nan
I_kwDOHQ63-s5auKbD test issue 1 146 2023-01-06T06:39:54Z OPEN 2023-01-06T06:42:00Z comment 02 nan
I_kwDOHQ63-s5auKm8 test issue 2 147 2023-01-06T06:40:34Z OPEN 2023-01-06T06:40:34Z nan food
I_kwDOHQ63-s5auKm8 test issue 2 147 2023-01-06T06:40:34Z OPEN 2023-01-06T06:40:34Z nan healthy

CodePudding user response:

@andrej-kesely has answered my question.
I have selected his response as the answer for this question.
I am now posting a consolidated script that includes my poor code and andrej's great code.

In this script i want to fetch details from Github's GraphQL API Server.
And put it inside pandas.
Primary source for this script is this gist.
And a major chunk of remaining code is an answer by @andrej-kesely. Now onto the consolidated script.

First import the necessary packages and set headers

import requests
import json
import pandas as pd

headers = {"Authorization": "token <your_github_personal_access_token>"}

Now define the query that will fetch data from github.
In my particular case, I am fetching issue details form a particular repo it can be something else for you.

query = """
{
  viewer {
    login
  }
repository(name: "your_github_repo", owner: "your_github_user_name") {
  issues(states: OPEN, last: 2) {
    edges {
      node {
        id
        title
        number
        createdAt
        closedAt
        state
        updatedAt
        comments(first: 10) {
          edges {
            node {
              id
              body
            }
          }
        }
        labels(orderBy: {field: NAME, direction: ASC}, first: 10) {
          edges {
            node {
              name
            }
          }
        }
        comments(first: 10) {
          edges {
            node {
              id
              body
            }
          }
        }
      }
      cursor
    }
  }
}
}
"""

Execute the query and save the response

def run_query(query):
    request = requests.post('https://api.github.com/graphql', json={'query': query}, headers=headers)
    if request.status_code == 200:
        return request.json()
    else:
        raise Exception("Query failed to run by returning code of {}. {}".format(request.status_code, query))

result = run_query(query)

And now is the trickiest part.
In my query response, there are several nested dictionaries.
I would like to split them - more details in my question above.
This magic code from @andrej-kesely does that for you.

df = pd.DataFrame(d['node'] for d in result['data']['repository']['issues']['edges'])
df['comments'] = df['comments'].str['edges']
df = df.explode('comments')
df['comments'] = df['comments'].str['node'].str['body']

df['labels'] = df['labels'].str['edges']
df = df.explode('labels')
df['labels'] = df['labels'].str['node'].str['name']

print(df)
  • Related