How to create a pandas dataframe from a nested json file?-CodePudding

I have a json file which looks like this:

{
    "data": {
        "success": true,
        "timeseries": true,
        "start_date": "2022-10-01",
        "end_date": "2022-10-04",
        "base": "EUR",
        "rates": {
            "2022-10-01": {
                "NG": 0.1448939471560284
            },
            "2022-10-02": {
                "NG": 0.14487923291390148
            },
            "2022-10-03": {
                "NG": 0.1454857922753868
            },
            "2022-10-04": {
                "NG": 0.1507352356663182
            }
        },
        "unit": "per MMBtu"
    }
}

I want to create a dataframe which looks like this:

Date        NG        base 
2022-10-01  0.144894  EUR
2022-10-02  0.144879  EUR
2022-10-03  0.145486  EUR
2022-10-04  0.150735  EUR

This is what I tried:

with open(r'C:\Users\EH\Desktop\tools\json_files\blue_file.json','r') as f:
    data = json.loads(f.read())

df1 = pd.DataFrame(data['data']['rates'])
df1 = df1.T
df2 = pd.DataFrame(data['data'])
df2 = df2.base

merge = [df1, df2]
df3 = pd.concat(merge)

print(df3)

My current output:

                  NG    0
2022-10-01  0.144894  NaN
2022-10-02  0.144879  NaN
2022-10-03  0.145486  NaN
2022-10-04  0.150735  NaN
2022-10-01       NaN  EUR
2022-10-02       NaN  EUR
2022-10-03       NaN  EUR
2022-10-04       NaN  EUR

As you can see something is not going okay, I dont understand where the NaNs come from.

CodePudding user response：

JSON is a very flexible format and no function in pandas can parse all types of JSON. You need to preprocess the JSON before constructing the dataframe:

with open("data.json") as fp:
    data = json.load(fp)

df = pd.DataFrame(
    [(date, rate["NG"]) for date, rate in data["data"]["rates"].items()],
    columns=["Date", "NG"],
).assign(base=data["data"]["base"])

CodePudding user response：

Using what you had done, you just needed to specify the axis on which you want to concatenate the 2 dataframes by default pandas concatenated one below the other.

pd.concat(merge, axis=1) # default 0

CodePudding user response：

simply:

import json
with open(r'path.json','r') as f:
    data = json.loads(f.read())

df=pd.DataFrame(data['data']).reset_index()
df=df.join(pd.json_normalize(df.pop('rates')))
print(df)
'''
    index      success  timeseries  start_date  end_date    base    unit    NG
0   2022-10-01  true    true    2022-10-01  2022-10-04  EUR per MMBtu   0.1448939471560284
1   2022-10-02  true    true    2022-10-01  2022-10-04  EUR per MMBtu   0.14487923291390148
2   2022-10-03  true    true    2022-10-01  2022-10-04  EUR per MMBtu   0.1454857922753868
3   2022-10-04  true    true    2022-10-01  2022-10-04  EUR per MMBtu   0.1507352356663182

'''

CodePudding user response：

You were almost there. The reason you are getting NaNs is that you need to specify different axis when concatenating the dataframes: pd.concat(merge, axis=1).

Here is my version:

import json
import pandas as pd

with open("blue_file.json", "r") as f:
    data = json.loads(f.read())

df = pd.DataFrame.from_dict(data["data"]["rates"], orient="index")
df["base"] = data["data"]["base"]
df

Output:

                  NG base
2022-10-01  0.144894  EUR
2022-10-02  0.144879  EUR
2022-10-03  0.145486  EUR
2022-10-04  0.150735  EUR

CodePudding user response：

Assuming that the file is stored in a variable called data

data = { "data": { "success": True, "timeseries": True, "start_date": "2022-10-01", "end_date": "2022-10-04", "base": "EUR", "rates": { "2022-10-01": { "NG": 0.1448939471560284 }, "2022-10-02": { "NG": 0.14487923291390148 }, "2022-10-03": { "NG": 0.1454857922753868 }, "2022-10-04": { "NG": 0.1507352356663182 } }, "unit": "per MMBtu" } }

First, one will read the data to a dataframe with pandas.DataFrame.from_dict as follows

df = pd.DataFrame.from_dict(data['data']['rates'], orient='index')

[Out]:
                  NG
2022-10-01  0.144894
2022-10-02  0.144879
2022-10-03  0.145486
2022-10-04  0.150735

But, as one wants doesn't want the columns with date to be the index, and one wants that column to have the name Date, one will start by resetting the index using pandas.DataFrame.reset_index

df = df.reset_index()

[Out]:
        index        NG
0  2022-10-01  0.144894
1  2022-10-02  0.144879
2  2022-10-03  0.145486
3  2022-10-04  0.150735

And then one will be able to rename the columns to the desired output format pandas.DataFrame.rename

df = df.rename(columns={'index':'Date'})

[Out]:

         Date        NG
0  2022-10-01  0.144894
1  2022-10-02  0.144879
2  2022-10-03  0.145486
3  2022-10-04  0.150735

As it is still missing the base column, one can pick it up from the data as follows

df['base'] = data['data']['base']

[Out]:

         Date        NG base
0  2022-10-01  0.144894  EUR
1  2022-10-02  0.144879  EUR
2  2022-10-03  0.145486  EUR
3  2022-10-04  0.150735  EUR

Even though one already has the desired output, one can wrap everything into a one-liner as follows

df = pd.DataFrame.from_dict(data['data']['rates'], orient='index').reset_index().rename(columns={'index':'Date'}).assign(base=data['data']['base'])

[Out]:
         Date        NG base
0  2022-10-01  0.144894  EUR
1  2022-10-02  0.144879  EUR
2  2022-10-03  0.145486  EUR
3  2022-10-04  0.150735  EUR