Home > Mobile >  Flattening nested json to have one row per item
Flattening nested json to have one row per item

Time:05-06

I have a nested json that I am trying to flatten:

[
  {
    "name": "table1",
    "count": 123,
    "columns": {
      "col1": "string",
      "col2": "string"
    },
    "partitions": 2
  },
  {
    "name": "table2",
    "count": 234,
    "columns": {
      "col3": "int",
      "col4": "string",
      "col5": "int"
    },
    "partitions": 4
  }
]

I am trying to flatten this into something like below:

name     count   col_name     col_type   partitions
table1    123      col1        string      2
table1    123      col2        string      2
table2    234      col3        int         4
table2    234      col4        string      4
table2    234      col5        int         4

Reading json into pandas dataframe.

with open("file.json") as datafile:
    data = json.load(datafile)
dataframe = pd.DataFrame(data)

pd.json_normalize doesn't work since I wouldn't like to create too many columns. Rather I am trying to create more rows. Could someone guide me as to how can I best achieve this in python or using pandas?

Appreciate any help. Thanks.

CodePudding user response:

This is the straight forward solution, where you are forming bunch of dictionaries from the corresponding keys and then creating a dataframe out of it.

import json
import pandas as pd
with open("abc.json") as datafile:
    data = json.load(datafile)
print(data)
d = [{'name': x['name'],'count':x['count'],'colname':k,'coltype':x['columns'][k], 'partitions':x['partitions']} for x in data for k in x['columns'].keys()]
df = pd.DataFrame.from_dict(d)
print(df)

Output

[{'name': 'table1', 'count': 123, 'columns': {'col1': 'string', 'col2': 'string'}, 'partitions': 2}, {'name': 'table2', 'count': 234, 'columns': {'col3': 'int', 'col4': 'string', 'col5': 'int'}, 'partitions': 4}]
     name  count colname coltype  partitions
0  table1    123    col1  string           2
1  table1    123    col2  string           2
2  table2    234    col3     int           4
3  table2    234    col4  string           4
4  table2    234    col5     int           4

CodePudding user response:

You can use wide_to_long:

df = pd.json_normalize(data)
cols = [c for c in df.columns if not c.startswith('columns')]

out = (pd.wide_to_long(df, stubnames='columns', i=cols, j='col_name', 
                       sep='.', suffix='col\d ')
         .rename(columns={'columns': 'col_type'})
         .query('col_type.notna()').reset_index())
print(out)

# Output
     name  count  partitions col_name col_type
0  table1    123           2     col1   string
1  table1    123           2     col2   string
2  table2    234           4     col3      int
3  table2    234           4     col4   string
4  table2    234           4     col5      int
  • Related