Flattening nested json to have one row per item-CodePudding

I have a nested json that I am trying to flatten:

[
  {
    "name": "table1",
    "count": 123,
    "columns": {
      "col1": "string",
      "col2": "string"
    },
    "partitions": 2
  },
  {
    "name": "table2",
    "count": 234,
    "columns": {
      "col3": "int",
      "col4": "string",
      "col5": "int"
    },
    "partitions": 4
  }
]

I am trying to flatten this into something like below:

name     count   col_name     col_type   partitions
table1    123      col1        string      2
table1    123      col2        string      2
table2    234      col3        int         4
table2    234      col4        string      4
table2    234      col5        int         4

Reading json into pandas dataframe.

with open("file.json") as datafile:
    data = json.load(datafile)
dataframe = pd.DataFrame(data)

pd.json_normalize doesn't work since I wouldn't like to create too many columns. Rather I am trying to create more rows. Could someone guide me as to how can I best achieve this in python or using pandas?

Appreciate any help. Thanks.

CodePudding user response：

This is the straight forward solution, where you are forming bunch of dictionaries from the corresponding keys and then creating a dataframe out of it.

import json
import pandas as pd
with open("abc.json") as datafile:
    data = json.load(datafile)
print(data)
d = [{'name': x['name'],'count':x['count'],'colname':k,'coltype':x['columns'][k], 'partitions':x['partitions']} for x in data for k in x['columns'].keys()]
df = pd.DataFrame.from_dict(d)
print(df)

Output

[{'name': 'table1', 'count': 123, 'columns': {'col1': 'string', 'col2': 'string'}, 'partitions': 2}, {'name': 'table2', 'count': 234, 'columns': {'col3': 'int', 'col4': 'string', 'col5': 'int'}, 'partitions': 4}]
     name  count colname coltype  partitions
0  table1    123    col1  string           2
1  table1    123    col2  string           2
2  table2    234    col3     int           4
3  table2    234    col4  string           4
4  table2    234    col5     int           4

CodePudding user response：

You can use wide_to_long:

df = pd.json_normalize(data)
cols = [c for c in df.columns if not c.startswith('columns')]

out = (pd.wide_to_long(df, stubnames='columns', i=cols, j='col_name', 
                       sep='.', suffix='col\d ')
         .rename(columns={'columns': 'col_type'})
         .query('col_type.notna()').reset_index())
print(out)

# Output
     name  count  partitions col_name col_type
0  table1    123           2     col1   string
1  table1    123           2     col2   string
2  table2    234           4     col3      int
3  table2    234           4     col4   string
4  table2    234           4     col5      int