I have a nested json that I am trying to flatten:
[
{
"name": "table1",
"count": 123,
"columns": {
"col1": "string",
"col2": "string"
},
"partitions": 2
},
{
"name": "table2",
"count": 234,
"columns": {
"col3": "int",
"col4": "string",
"col5": "int"
},
"partitions": 4
}
]
I am trying to flatten this into something like below:
name count col_name col_type partitions
table1 123 col1 string 2
table1 123 col2 string 2
table2 234 col3 int 4
table2 234 col4 string 4
table2 234 col5 int 4
Reading json into pandas dataframe.
with open("file.json") as datafile:
data = json.load(datafile)
dataframe = pd.DataFrame(data)
pd.json_normalize
doesn't work since I wouldn't like to create too many columns.
Rather I am trying to create more rows.
Could someone guide me as to how can I best achieve this in python or using pandas?
Appreciate any help. Thanks.
CodePudding user response:
This is the straight forward solution, where you are forming bunch of dictionaries from the corresponding keys and then creating a dataframe out of it.
import json
import pandas as pd
with open("abc.json") as datafile:
data = json.load(datafile)
print(data)
d = [{'name': x['name'],'count':x['count'],'colname':k,'coltype':x['columns'][k], 'partitions':x['partitions']} for x in data for k in x['columns'].keys()]
df = pd.DataFrame.from_dict(d)
print(df)
Output
[{'name': 'table1', 'count': 123, 'columns': {'col1': 'string', 'col2': 'string'}, 'partitions': 2}, {'name': 'table2', 'count': 234, 'columns': {'col3': 'int', 'col4': 'string', 'col5': 'int'}, 'partitions': 4}]
name count colname coltype partitions
0 table1 123 col1 string 2
1 table1 123 col2 string 2
2 table2 234 col3 int 4
3 table2 234 col4 string 4
4 table2 234 col5 int 4
CodePudding user response:
You can use wide_to_long
:
df = pd.json_normalize(data)
cols = [c for c in df.columns if not c.startswith('columns')]
out = (pd.wide_to_long(df, stubnames='columns', i=cols, j='col_name',
sep='.', suffix='col\d ')
.rename(columns={'columns': 'col_type'})
.query('col_type.notna()').reset_index())
print(out)
# Output
name count partitions col_name col_type
0 table1 123 2 col1 string
1 table1 123 2 col2 string
2 table2 234 4 col3 int
3 table2 234 4 col4 string
4 table2 234 4 col5 int