Home > Net >  Using json_normalize function to create a relational data model?
Using json_normalize function to create a relational data model?

Time:11-27

I have a nested Python dictionary that I want to convert into a relational model. I am struggling to parse the dictionary into two related tables: a "workspace" table and a "datasets" table - joined by the key workspace_id

enter image description here

simplified_dict ={
   "workspaces":[
      {
         "workspace_id":"d507422c",
         "workspace_name":"Workspace 1",
         "datasets":[
            {
               "dataset_id":"e7e8a355",
               "dataset_name":"Dataset 1 in workspace 1"
            },
            {
               "dataset_id":"bbe8a355",
               "dataset_name":"Dataset 2 in workspace 1"
            }
         ]
      },
      {
         "workspace_id":"etyyy422c-8d6d",
         "workspace_name":"Workspace 2",
         "datasets":[
            {
               "dataset_id":"89jke8a355",
               "dataset_name":"Dataset 3 in Workspace 2"
            },
            {
               "dataset_id":"tyii8a355",
               "dataset_name":"Dataset 4 in workspace 2"
            }
         ]
      }
   ],
   "datasourceInstances":[
      
   ]
}

I can create a table containing workspace information using the pandas json_normalize function.

import pandas as pd
df_workspaces = pd.json_normalize(simplified_dict, record_path=['workspaces'])
df_workspaces

However, when I try and create the second table "datasets" using the same function, I get a dataframe, but it doesn't have a workspace key, that allows me to join the two tables.

df_datasets_in_workspaces = pd.json_normalize(simplified_dict, record_path=['workspaces','datasets'])
df_datasets_in_workspaces

Is there a way to add the workspace key to the this datasets table, to enable the join, while still using the json_normalize function?

If possible I would prefer a solution using json_normalize, rather than using a loop or comprehension, as the json_normalize allows me to easily convert any layer of my real data (with 5 levels of nesting) into a dataframe. With my real datset, I will be looking to generate circa 15 tables, so a low code, very intuitive approach is prefered.

Copy of a google colab notebook with the code is accessible here

Any help would be appreciated.

CodePudding user response:

workspace_df = pd.json_normalize(data=simplified_dict, record_path=["workspaces"]).drop(columns="datasets")
print(workspace_df)

datasets_df = pd.json_normalize(data=simplified_dict["workspaces"], meta=["workspace_id"], record_path=["datasets"])
print(datasets_df)

Outputs:

     workspace_id workspace_name
0        d507422c    Workspace 1
1  etyyy422c-8d6d    Workspace 2

   dataset_id              dataset_name    workspace_id
0    e7e8a355  Dataset 1 in workspace 1        d507422c
1    bbe8a355  Dataset 2 in workspace 1        d507422c
2  89jke8a355  Dataset 3 in Workspace 2  etyyy422c-8d6d
3   tyii8a355  Dataset 4 in workspace 2  etyyy422c-8d6d
  • Related