I have a json file from an API response that broadly follows the following schema
import pandas as pd
j =[
{
"orders": [
{
"orderId": 0,
"items": [
{ "item_1": "x", "item_price": 5.99 },
{ "item_1": "y", "item_price": 15.99 }
]
}
]
}
]
What I'm trying to do is to normalize the json but retain the orderId from the top level so I can re-join the datasets further down a data pipeline.
I thought this could be done with one of the keyword arguments in pd.json_normalize
df_orders = pd.json_normalize(j,record_path=['orders'])
print(df_orders)
orderId items
0 0 [{'item_1': 'x', 'item_price': 5.99}, {'item_1...
df_items = pd.json_normalize(j,record_path=['orders','items'])
print(df_items)
item_1 item_price
0 x 5.99
1 y 15.99
What I would like is
item_1 item_price orders.orderId
0 x 5.99 0
1 y 15.99 0
--
I've gone through the documentation here
and attempted to use
pd.json_normalize(j,record_path=['orders','items'], meta=['orders'])
which just appends the entire json object to each record.
item_1 item_price orders
0 x 5.99 {'orderId': 0, 'items': [{'item_1': 'x', 'item...
1 y 15.99 {'orderId': 0, 'items': [{'item_1': 'x', 'item...
any tips using pd.json_normalize
?
I have no issue in parsing the json out manually and doing this in a 2 step process but want to avoid further depencies in my pipeline to keep things dynamic if i can help it.
CodePudding user response:
You can try putting an extra pair of square brackets on the meta
parameter and specify upto orderId
.
pd.json_normalize(j,record_path=['orders','items'], meta=[['orders', 'orderId']])
Result:
item_1 item_price orders.orderId
0 x 5.99 0
1 y 15.99 0
The syntax of meta
in pd.json_normalize
is that whenever you want to access a second level field down from top level, you have to specify it under a second level list. The list in the second level in this case, i.e. ['orders', 'orderId']
acts like a path. We still can't skip specifying the top level list even when we have 2 items in this second level list. For reference, we can look at the examples in the official doc and look at e.g. the case we want to access the field governor
in the example.
The field orderId
is regarded as a second level field instead of top level, probably owing to the fact that the field orders
is regarded as the top level, which in turn is because it is the data j
that we pass to pd.json_normalize
as input data and pd.json_normalize
counts levels based on this input data j
. Even though we specify record_path
, these relative levels are still the same.
CodePudding user response:
meta
expects a nested list of lists (strings denoting the path to the field) for each additional field you would like included in your flattened DataFrame.
This works:
pd.json_normalize(j,record_path=['orders','items'],meta[['orders','orderId']])
item_1 item_price orders.orderId
0 x 5.99 0
1 y 15.99 0