Home > OS >  json_normalize a nested database
json_normalize a nested database

Time:11-11

I'm trying to flatten a json database into pandas dataframe and as it's the first time I'm dealing with json format, I can't do what I want. The database is located here https://mtgjson.com/downloads/all-files/#allprices and according to the model, the structure is like this :

{
  "0120a941-9cfb-50b5-b5e4-4e0c7bd32410": {
    "mtgo": {
      "cardhoarder": {
        "currency": "USD",
        "retail": {
          "foil": {
            ..., // more rows
            "2020-04-21": 0.02
          },
          "normal": {
            ..., // more rows
            "2020-04-21": 0.02
          }
        }
      },
    },
    "paper": {
      "cardkingdom" : {
        "buylist": {
          "foil": {
            ..., // more rows
            "2020-04-21": 0.6
          },
          "normal": {
            ..., // more rows
            "2020-04-21": 0.01
          }
        },
        "currency": "USD",
        "retail": {
          "foil": {
            ..., // more rows
            "2020-04-21": 0.12
          },
          "normal": {
            ..., // more rows
            "2020-04-21": 0.02
          }
        }
      },
      "cardmarket": {
        "currency": "EUR",
        "retail": {
          "foil": {
            ..., // more rows
            "2020-04-21": 0.12
          },
          "normal": {
            ..., // more rows
            "2020-04-21": 0.02
          }
        }
      },
      "tcgplayer": {
        "currency": "USD",
        "retail": {
          "foil": {
            ..., // more rows
            "2020-04-21": 0.12
          },
          "normal": {
            ..., // more rows
            "2020-04-21": 0.02
          }
        }
      }
    }
  }
}

When I look into the json file I have this :

{"meta": {"date": "2021-11-07", "version": "5.1.0 20211107"}, "data": {"00010d56-fe38-5e35-8aed-518019aa36a5": {"paper": {"cardkingdom": {"buylist": {"foil.....

When I do the basic pd.read_json('AllPrices.json') I got this

meta data
date 2021-11-07 NaN
version 5.1.0 20211107 NaN
00010d56-fe38-5e35-8aed-518019aa36a5 NaN {'paper': {'cardkingdom': {'buylist': {'foil':...
0001e0d0-2dcd-5640-aadc-a84765cf5fc9 NaN {'paper': {'cardkingdom': {'buylist': {'normal...

So I did some research and found the json_normalize and wrote this piece of code :

with open('AllPrices.json','r') as f:
    data = json.loads(f.read())
pd.json_normalize(data, errors='ignore')

This did the job by flattening the json database but I ended with one row and 31 millions columns. What I want is only one information in this database that is the uuid and the cardmarket price of a normal paper card on the date I want like this :

uuid paper.cardmarket.retail.normal.2021-11-07
00010d56-fe38-5e35-8aed-518019aa36a5 0.5
0001e0d0-2dcd-5640-aadc-a84765cf5fc9 0.25

I played with the record_path = parameter and the meta = parameter but the best I did was not my expected table. I tried record_path = ['data'] that give me only the uuid in one column.

Thanks for your help

CodePudding user response:

In your case, use json_normalize for each uuid record then extract the desired information 'paper.cardmarket.retail.normal':

with open('AllPrices.json') as fp:
    prices = json.load(fp)

    data = []
    for uuid in prices['data']:
        df = pd.json_normalize(prices['data'][uuid]) \
               .filter(like='paper.cardmarket.retail.normal')
        if df.empty:
            continue
        df.columns = df.columns.str.rsplit('.', 1).str[-1]
        df.index = [uuid]
        data.append(df)
    df = pd.concat(data)

Output: (tested on AllPrices.json file)

                                      2021-08-09  2021-08-11  2021-08-12  2021-08-13  ...  2021-10-27  2021-10-28  2021-11-02  2021-11-09
00010d56-fe38-5e35-8aed-518019aa36a5        4.35        4.35        4.35        4.35  ...        4.35        4.35        4.35        4.35
0001e0d0-2dcd-5640-aadc-a84765cf5fc9        4.95        4.95        3.45        3.45  ...        7.99        6.81        5.42        6.77
0003caab-9ff5-5d1a-bc06-976dd0457f19        0.24        0.27        0.07        0.38  ...        0.25        0.04        0.13        0.36
0003d249-25d9-5223-af1e-1130f09622a7        0.30        0.30        0.75        0.75  ...        0.15        0.20        0.04        0.25
0004a4fb-92c6-59b2-bdbe-ceb584a9e401        0.27        0.14        0.19        0.10  ...        0.10        0.05        0.19        0.13
...                                          ...         ...         ...         ...  ...         ...         ...         ...         ...
fffa4ccf-733e-513a-98f9-181b9549de62        0.23        0.15        0.21        0.21  ...        0.10        0.21        0.15        0.20
fffb659e-b3fa-5cd8-9423-fe5ac74248b5        0.49        0.49        0.49        0.49  ...        0.35        0.35        0.35        0.20
fffbc95a-c4d1-56aa-8653-8a7c71fe19ce        6.95        6.95        6.95        6.95  ...       10.26       10.26       10.26        6.43
fffc1305-a118-559b-9504-3d7b56ca0bde        0.18        0.18        0.18        0.18  ...        0.04        0.04        0.04        0.04
fffdd333-3789-5104-a8be-37be199a2cb1        0.87        0.73        0.99        0.99  ...        0.49        0.45        0.45        0.15

[50568 rows x 75 columns]

CodePudding user response:

You can define a function that will get the value in the dictionary and apply this function to df['data']:

df = pd.read_json('AllPrices.json')

def extract_from_json(cell):
    if isinstance(cell, dict) and "paper" in cell.keys():
        return list(cell["paper"]["cardmarket"]["retail"]["normal"].values())[0]
    return cell

df["data"]=df["data"].apply(extract_from_json)
print(df)`

This will return the price for the first date in the list (index 0). If you only want the price at a specific date your function should return the following:

return cell["paper"]["cardmarket"]["retail"]["normal"].get("2021-11-07", None)
  • Related