Home > Back-end >  parsing nested json using Pandas
parsing nested json using Pandas

Time:09-06

I wanted to try to parse this nested json using Pandas, and I'm confused when i wanted to extract the data from column "amount" and "items", and the data has so many rows like hundreds, this is one of the example

{
    "_id": "62eaa99b014c9bb30203e48a",
    "amount": {
      "product": 291000,
      "shipping": 75000,
      "admin_fee": 4500,
      "order_voucher_deduction": 0,
      "transaction_voucher_deduction": 0,
      "total": 366000,
      "paid": 366000
    },
    "status": 32,
    "items": [
      {
        "_id": "62eaa99b014c9bb30203e48d",
        "earning": 80400,
        "variants": [
          {
            "name": "Color",
            "value": "Black"
          },
          {
            "name": "Size",
            "value": "38"
          }
        ],
        "marketplace_price": 65100,
        "product_price": 62000,
        "reseller_price": 145500,
        "product_id": 227991,
        "name": "Heels",
        "sku_id": 890512,
        "internal_markup": 3100,
        "weight": 500,
        "image": "https://product-asset.s3.ap-southeast-1.amazonaws.com/1659384575578.jpeg",
        "quantity": 1,
        "supplier_price": 60140
      }

I've tried using this and it'd only shows the index

dfjson=pd.json_normalize(datasetjson)
dfjson.head(3)

enter image description here

##UPDATE enter image description here

I tried added the pd.Dataframe , yes it works to become dataframe, but i still haven't got to know how to extract the _id, earning, variants

CodePudding user response:

Given:

data = {
 '_id': '62eaa99b014c9bb30203e48a',
 'amount': {'admin_fee': 4500,
            'order_voucher_deduction': 0,
            'paid': 366000,
            'product': 291000,
            'shipping': 75000,
            'total': 366000,
            'transaction_voucher_deduction': 0},
 'items': [{'_id': '62eaa99b014c9bb30203e48d',
            'earning': 80400,
            'image': 'https://product-asset.s3.ap-southeast-1.amazonaws.com/1659384575578.jpeg',
            'internal_markup': 3100,
            'marketplace_price': 65100,
            'name': 'Heels',
            'product_id': 227991,
            'product_price': 62000,
            'quantity': 1,
            'reseller_price': 145500,
            'sku_id': 890512,
            'supplier_price': 60140,
            'variants': [{'name': 'Color', 'value': 'Black'},
                         {'name': 'Size', 'value': '38'}],
            'weight': 500}],
 'status': 32
}

Doing:

df = pd.json_normalize(data, ['items'], ['amount'])
df = df.join(df.amount.apply(pd.Series))
df = df.join(df.variants.apply(pd.DataFrame)[0].set_index('name').T.reset_index(drop=True))
df = df.drop(['amount', 'variants'], axis=1)
print(df)

Output:

                        _id  earning  marketplace_price  product_price  reseller_price  product_id   name  sku_id  internal_markup  weight                                              image  quantity  supplier_price  product  shipping  admin_fee  order_voucher_deduction  transaction_voucher_deduction   total    paid  Color Size
0  62eaa99b014c9bb30203e48d    80400              65100          62000          145500      227991  Heels  890512             3100     500  https://product-asset.s3.ap-southeast-1.amazon...         1           60140   291000     75000       4500                        0                              0  366000  366000  Black   38

There's probably a better way to do some of this, but the sample provided wasn't even a valid json object, so I can't be sure what the real data actually looks like.

CodePudding user response:

Try pd.json_normalize(datasetjson, max_level=0)

CodePudding user response:

I guess you are confuse working with dictionaries or json format.

This line is the same sample you have but it's missed ]} at the end. I formatted removing blank spaces but it's the same:

dfjson = {"_id":"62eaa99b014c9bb30203e48a","amount":{"product":291000,"shipping":75000,"admin_fee":4500,"order_voucher_deduction":0,"transaction_voucher_deduction":0,"total":366000,"paid":366000},"status":32,"items":[{"_id":"62eaa99b014c9bb30203e48d","earning":80400,"variants":[{"name":"Color","value":"Black"},{"name":"Size","value":"38"}],"marketplace_price":65100,"product_price":62000,"reseller_price":145500,"product_id":227991,"name":"Heels","sku_id":890512,"internal_markup":3100,"weight":500,"image":"https://product-asset.s3.ap-southeast-1.amazonaws.com/1659384575578.jpeg","quantity":1,"supplier_price":60140}]}

Now, if you want to call amount:

dfjson['amount']
# Output
{'product': 291000,
 'shipping': 75000,
 'admin_fee': 4500,
 'order_voucher_deduction': 0,
 'transaction_voucher_deduction': 0,
 'total': 366000,
 'paid': 366000}

If you want to call items:

dfjson['items']
# Output
[{'_id': '62eaa99b014c9bb30203e48d',
  'earning': 80400,
  'variants': [{'name': 'Color', 'value': 'Black'},
   {'name': 'Size', 'value': '38'}],
  'marketplace_price': 65100,
  'product_price': 62000,
  'reseller_price': 145500,
  'product_id': 227991,
  'name': 'Heels',
  'sku_id': 890512,
  'internal_markup': 3100,
  'weight': 500,
  'image': 'https://product-asset.s3.ap-southeast-1.amazonaws.com/1659384575578.jpeg',
  'quantity': 1,
  'supplier_price': 60140}]

For getting the items, you can create a list:

list_items = []
for i in dfjson['items']:
    list_items.append(i)

CodePudding user response:

If you want to use a Dataframe, the data must be in 2d format. Your data is list,dict.. organized by data, you need to cut it first and then convert it to Dataframe.

CodePudding user response:

For how to import the entire json data into a pandas DataFrame check out the answer given by BeRT2me

but if you are only after extracting _id, earning, variants to a Pandas DataFrame giving this pandas data export to a dictionary:

{'_id'     : {0: '62eaa99b014c9bb30203e48a'},
 '_id_id'  : {0: '62eaa99b014c9bb30203e48d'},
 'earning' : {0: 80400},
 'variants': {0: [{'name': 'Color', 'value': 'Black'},
                  {'name': 'Size', 'value': '38'}]}}

as you state in your question:

but I still haven't got to know how to extract the _id, earning, variants

notice that the problem with extracting _id, earning, variants is that this values are 'hidden" within a list of one item. Resolving this issue with indexing it by [0] gives the required values:

json_text = """\
{'_id': '62eaa99b014c9bb30203e48a',
 'amount': {'admin_fee': 4500,
            'order_voucher_deduction': 0,
            'paid': 366000,
            'product': 291000,
            'shipping': 75000,
            'total': 366000,
            'transaction_voucher_deduction': 0},
 'items': [{'_id': '62eaa99b014c9bb30203e48d',
            'earning': 80400,
            'image': 'https://product-asset.s3.ap-southeast-1.amazonaws.com/1659384575578.jpeg',
            'internal_markup': 3100,
            'marketplace_price': 65100,
            'name': 'Heels',
            'product_id': 227991,
            'product_price': 62000,
            'quantity': 1,
            'reseller_price': 145500,
            'sku_id': 890512,
            'supplier_price': 60140,
            'variants': [{'name': 'Color', 'value': 'Black'},
                         {'name': 'Size', 'value': '38'}],
            'weight': 500}],
 'status': 32}

json_dict = eval(json_text)

print(f'{(_id      := json_dict["items"][0]["_id"])=}')
print(f'{(earning  := json_dict["items"][0]["earning"])=}')
print(f'{(variants := json_dict["items"][0]["variants"])=}')
print('---')
print(f'{_id=}')
print(f'{earning=}')
print(f'{variants=}')

gives:

(_id      := json_dict["items"][0]["_id"])='62eaa99b014c9bb30203e48d'
(earning  := json_dict["items"][0]["earning"])=80400
(variants := json_dict["items"][0]["variants"])=[{'name': 'Color', 'value': 'Black'}, {'name': 'Size', 'value': '38'}]
---
_id='62eaa99b014c9bb30203e48d'
earning=80400
variants=[{'name': 'Color', 'value': 'Black'}, {'name': 'Size', 'value': '38'}]```

If you want in addition a Pandas DataFrame with rows holding all this extracted values you can loop over all your json data files adding a row to a DataFrame using:

# Create an empty DataFrame: 
df = pd.DataFrame(columns=['_id', '_id_id', 'earning', 'variants'])
# Add rows to df in a loop processing the json data files: 
df_to_append = pd.DataFrame(
    [[json_dict['_id'], _id, earning, variants]], 
    columns=['_id', '_id_id', 'earning', 'variants']
    )
df = df.append(df_to_append)
from pprint import pprint
pprint(df.to_dict())
  • Related