Problem
I was given a pretty big json file that looks like this minimal example:
json_file = """
{
"products":
[
{
"id":"0",
"name": "First",
"emptylist":[],
"properties" :
{
"id" : "",
"name" : ""
}
},
{
"id":"1",
"name": "Second",
"emptylist":[],
"properties":
{
"id" : "23",
"name" : "a useful product",
"features" :
[
{
"name":"Features",
"id":"18",
"features":
[
{
"id":"1001",
"name":"Colour",
"value":"Black"
},
{
"id":"2093",
"name":"Material",
"value":"Plastic"
}
]
},
{
"name":"Sizes",
"id":"34",
"features":
[
{
"id":"4736",
"name":"Length",
"value":"56"
},
{
"id":"8745",
"name":"Width",
"value":"76"
}
]
}
]
}
},
{
"id":"2",
"name": "Third",
"properties" :
{
"id" : "876",
"name" : "another one",
"features" :
[
{
"name":"Box",
"id":"937",
"features":
[
{
"id":"3758",
"name":"Amount",
"value":"1"
},
{
"id":"2222",
"name":"Packaging",
"value":"Blister"
}
]
},
{
"name":"Features",
"id":"8473",
"features":
[
{
"id":"9372",
"name":"Colour",
"value":"White"
},
{
"id":"9375",
"name":"Position",
"value":"A"
},
{
"id":"2654",
"name":"Amount",
"value":"6"
}
]
}
]
}
}
]
}
"""
And I want to make a flat table out of it. It should look like this one:
id name emptylist properties.id properties.name properties.features.name properties.features.id properties.features.features.id properties.features.features.name properties.features.features.value
0 First [] "" "" NaN NaN NaN NaN NaN
1 Second [] "23" "a useful product" Features 18 1001 Colour Black
1 Second [] "23" "a useful product" Features 18 2093 Material Plastic
1 Second [] "23" "a useful product" Sizes 34 4736 Length 56
1 Second [] "23" "a useful product" Sizes 34 8745 Width 76
2 Third "876" "another one" Box 937 3758 Amount 1
2 Third "876" "another one" Box 937 2222 Packaging Blister
2 Third "876" "another one" Features 8473 9372 Colour White
2 Third "876" "another one" Features 8473 9375 Position A
2 Third "876" "another one" Features 8473 2654 Amount 6
What I tried
I tried this:
import pandas as pd
import json
j = json.loads(json_file)
df = pd.json_normalize(j['products'])
df
id name emptylist properties.id properties.name properties.features
0 0 First [] NaN
1 1 Second [] 23 a useful product [{'name': 'Features', 'id': '18', 'features': ...
2 2 Third NaN 876 another one [{'name': 'Box', 'id': '937', 'features': [{'i...
And I tried to play a bit with the additional arguments, but I got nowhere. It seems like this is not the right way.
Can anyone help me?
Additional infos
I got a working solution with R, but I need to be able to do it with Python. If it helps, this would be the R code that I am trying to translate in Python.
library(tidyr)
jsonlite::fromJSON(json_file)$products %>%
jsonlite::flatten() %>%
unnest(properties.features , names_sep = ".", keep_empty = TRUE) %>%
unnest(properties.features.features, names_sep = ".", keep_empty = TRUE)
EDIT
With the help of @piterbarg and some research I got to this solution:
j = json.loads(json_file)
df = pd.json_normalize(j['products'])
df1 = df.explode('properties.features')
df2 = pd.concat([df1.reset_index(drop=True).drop('properties.features', axis = 1),
df1['properties.features'].apply(pd.Series).reset_index(drop=True).add_prefix("properties.features.").drop("properties.features.0", axis = 1)], axis = 1)
df2 = df2.explode('properties.features.features')
df3 = pd.concat([df2.reset_index(drop=True).drop('properties.features.features', axis = 1),
df2['properties.features.features'].apply(pd.Series).reset_index(drop=True).add_prefix("properties.features.features.").drop("properties.features.features.0", axis = 1)], axis = 1)
df3
With this I get exactly the solution I'm looking for but the code looks pretty messy and I'm not sure how efficient this solution may be. Any help?
CodePudding user response:
It is similar to what you have in Edit, but perhaps slightly shorter syntax and more performant.
If you have NaN in the DataFrame, older version of Pandas could fail on json_normalize
.
This solution should work with Pandas 1.3 .
df = pd.json_normalize(products)
df = df.explode('properties.features')
df = pd.concat([df.drop('properties.features', axis=1).reset_index(drop=True),
pd.json_normalize(df['properties.features']).add_prefix('properties.features.')], axis=1)
df = df.explode('properties.features.features')
df = pd.concat([df.drop('properties.features.features', axis=1).reset_index(drop=True),
pd.json_normalize(df['properties.features.features']).add_prefix('properties.features.features.')], axis=1)
Perf. with 1000 products.
Code in Edit : 4.85 s ± 218 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This solution: 58.3 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
This can be done with a repeated, if somewhat tedious, application of explode
to expand lists and apply(pd.Series)
to expand dicts:
df1 = df.explode('properties.features')
df2 = df1.join(df1['properties.features'].apply(pd.Series), lsuffix = '', rsuffix = '.properties.features').explode('features').drop(columns = 'properties.features')
df3 = df2.join(df2['features'].apply(pd.Series), lsuffix = '', rsuffix='.features').drop(columns = ['features','emptylist']).drop_duplicates()
df3
looks like this:
id name properties.id properties.name 0 id.properties.features name.properties.features 0.features id.features name.features value
-- ---- ------ --------------- ----------------- --- ------------------------ -------------------------- ------------ ------------- --------------- -------
0 0 First nan nan nan nan nan nan nan
1 1 Second 23 a useful product nan 18 Features nan 1001 Colour Black
1 1 Second 23 a useful product nan 18 Features nan 2093 Material Plastic
1 1 Second 23 a useful product nan 18 Features nan 4736 Length 56
1 1 Second 23 a useful product nan 18 Features nan 8745 Width 76
1 1 Second 23 a useful product nan 34 Sizes nan 1001 Colour Black
1 1 Second 23 a useful product nan 34 Sizes nan 2093 Material Plastic
1 1 Second 23 a useful product nan 34 Sizes nan 4736 Length 56
1 1 Second 23 a useful product nan 34 Sizes nan 8745 Width 76
2 2 Third 876 another one nan 937 Box nan 3758 Amount 1
2 2 Third 876 another one nan 937 Box nan 2222 Packaging Blister
2 2 Third 876 another one nan 937 Box nan 9372 Colour White
2 2 Third 876 another one nan 937 Box nan 9375 Position A
2 2 Third 876 another one nan 937 Box nan 2654 Amount 6
2 2 Third 876 another one nan 8473 Features nan 3758 Amount 1
2 2 Third 876 another one nan 8473 Features nan 2222 Packaging Blister
2 2 Third 876 another one nan 8473 Features nan 9372 Colour White
2 2 Third 876 another one nan 8473 Features nan 9375 Position A
2 2 Third 876 another one nan 8473 Features nan 2654 Amount 6
The names are not quite like you want them, this could be fixed with .rename(columns = {...})
if you want