normalize json but keep id from top level --CodePudding

I have a json file from an API response that broadly follows the following schema

import pandas as pd 


j =[
  {
    "orders": [
      {
        "orderId": 0,
        "items": [
          { "item_1": "x", "item_price": 5.99 },
          { "item_1": "y", "item_price": 15.99 }
        ]
      }
    ]
  }
]

What I'm trying to do is to normalize the json but retain the orderId from the top level so I can re-join the datasets further down a data pipeline.

I thought this could be done with one of the keyword arguments in pd.json_normalize

df_orders = pd.json_normalize(j,record_path=['orders'])

print(df_orders)

   orderId                                              items
0        0  [{'item_1': 'x', 'item_price': 5.99}, {'item_1...

df_items = pd.json_normalize(j,record_path=['orders','items'])
print(df_items)

  item_1  item_price
0      x        5.99
1      y       15.99

What I would like is

  item_1  item_price  orders.orderId
0      x        5.99  0
1      y       15.99  0

I've gone through the documentation here

and attempted to use

pd.json_normalize(j,record_path=['orders','items'], meta=['orders'])

which just appends the entire json object to each record.

  item_1  item_price                                             orders
0      x        5.99  {'orderId': 0, 'items': [{'item_1': 'x', 'item...
1      y       15.99  {'orderId': 0, 'items': [{'item_1': 'x', 'item...

any tips using pd.json_normalize?

I have no issue in parsing the json out manually and doing this in a 2 step process but want to avoid further depencies in my pipeline to keep things dynamic if i can help it.

CodePudding user response：

You can try putting an extra pair of square brackets on the meta parameter and specify upto orderId.

pd.json_normalize(j,record_path=['orders','items'], meta=[['orders', 'orderId']])

Result:

  item_1  item_price orders.orderId
0      x        5.99              0
1      y       15.99              0

The syntax of meta in pd.json_normalize is that whenever you want to access a second level field down from top level, you have to specify it under a second level list. The list in the second level in this case, i.e. ['orders', 'orderId'] acts like a path. We still can't skip specifying the top level list even when we have 2 items in this second level list. For reference, we can look at the examples in the official doc and look at e.g. the case we want to access the field governor in the example.

The field orderId is regarded as a second level field instead of top level, probably owing to the fact that the field orders is regarded as the top level, which in turn is because it is the data j that we pass to pd.json_normalize as input data and pd.json_normalize counts levels based on this input data j. Even though we specify record_path, these relative levels are still the same.

CodePudding user response：

meta expects a nested list of lists (strings denoting the path to the field) for each additional field you would like included in your flattened DataFrame.

This works:

pd.json_normalize(j,record_path=['orders','items'],meta[['orders','orderId']])

  item_1  item_price orders.orderId
0      x        5.99              0
1      y       15.99              0