I have a json file which looks like this:
{
"data": {
"success": true,
"timeseries": true,
"start_date": "2022-10-01",
"end_date": "2022-10-04",
"base": "EUR",
"rates": {
"2022-10-01": {
"NG": 0.1448939471560284
},
"2022-10-02": {
"NG": 0.14487923291390148
},
"2022-10-03": {
"NG": 0.1454857922753868
},
"2022-10-04": {
"NG": 0.1507352356663182
}
},
"unit": "per MMBtu"
}
}
I want to create a dataframe which looks like this:
Date NG base
2022-10-01 0.144894 EUR
2022-10-02 0.144879 EUR
2022-10-03 0.145486 EUR
2022-10-04 0.150735 EUR
This is what I tried:
with open(r'C:\Users\EH\Desktop\tools\json_files\blue_file.json','r') as f:
data = json.loads(f.read())
df1 = pd.DataFrame(data['data']['rates'])
df1 = df1.T
df2 = pd.DataFrame(data['data'])
df2 = df2.base
merge = [df1, df2]
df3 = pd.concat(merge)
print(df3)
My current output:
NG 0
2022-10-01 0.144894 NaN
2022-10-02 0.144879 NaN
2022-10-03 0.145486 NaN
2022-10-04 0.150735 NaN
2022-10-01 NaN EUR
2022-10-02 NaN EUR
2022-10-03 NaN EUR
2022-10-04 NaN EUR
As you can see something is not going okay, I dont understand where the NaNs come from.
CodePudding user response:
JSON is a very flexible format and no function in pandas can parse all types of JSON. You need to preprocess the JSON before constructing the dataframe:
with open("data.json") as fp:
data = json.load(fp)
df = pd.DataFrame(
[(date, rate["NG"]) for date, rate in data["data"]["rates"].items()],
columns=["Date", "NG"],
).assign(base=data["data"]["base"])
CodePudding user response:
Using what you had done, you just needed to specify the axis on which you want to concatenate the 2 dataframes by default pandas concatenated one below the other.
pd.concat(merge, axis=1) # default 0
CodePudding user response:
simply:
import json
with open(r'path.json','r') as f:
data = json.loads(f.read())
df=pd.DataFrame(data['data']).reset_index()
df=df.join(pd.json_normalize(df.pop('rates')))
print(df)
'''
index success timeseries start_date end_date base unit NG
0 2022-10-01 true true 2022-10-01 2022-10-04 EUR per MMBtu 0.1448939471560284
1 2022-10-02 true true 2022-10-01 2022-10-04 EUR per MMBtu 0.14487923291390148
2 2022-10-03 true true 2022-10-01 2022-10-04 EUR per MMBtu 0.1454857922753868
3 2022-10-04 true true 2022-10-01 2022-10-04 EUR per MMBtu 0.1507352356663182
'''
CodePudding user response:
You were almost there. The reason you are getting NaN
s is that you need to specify different axis when concatenating the dataframes: pd.concat(merge, axis=1)
.
Here is my version:
import json
import pandas as pd
with open("blue_file.json", "r") as f:
data = json.loads(f.read())
df = pd.DataFrame.from_dict(data["data"]["rates"], orient="index")
df["base"] = data["data"]["base"]
df
Output:
NG base
2022-10-01 0.144894 EUR
2022-10-02 0.144879 EUR
2022-10-03 0.145486 EUR
2022-10-04 0.150735 EUR
CodePudding user response:
Assuming that the file is stored in a variable called data
data = { "data": { "success": True, "timeseries": True, "start_date": "2022-10-01", "end_date": "2022-10-04", "base": "EUR", "rates": { "2022-10-01": { "NG": 0.1448939471560284 }, "2022-10-02": { "NG": 0.14487923291390148 }, "2022-10-03": { "NG": 0.1454857922753868 }, "2022-10-04": { "NG": 0.1507352356663182 } }, "unit": "per MMBtu" } }
First, one will read the data to a dataframe with pandas.DataFrame.from_dict
as follows
df = pd.DataFrame.from_dict(data['data']['rates'], orient='index')
[Out]:
NG
2022-10-01 0.144894
2022-10-02 0.144879
2022-10-03 0.145486
2022-10-04 0.150735
But, as one wants doesn't want the columns with date to be the index, and one wants that column to have the name Date
, one will start by resetting the index using pandas.DataFrame.reset_index
df = df.reset_index()
[Out]:
index NG
0 2022-10-01 0.144894
1 2022-10-02 0.144879
2 2022-10-03 0.145486
3 2022-10-04 0.150735
And then one will be able to rename the columns to the desired output format pandas.DataFrame.rename
df = df.rename(columns={'index':'Date'})
[Out]:
Date NG
0 2022-10-01 0.144894
1 2022-10-02 0.144879
2 2022-10-03 0.145486
3 2022-10-04 0.150735
As it is still missing the base column, one can pick it up from the data
as follows
df['base'] = data['data']['base']
[Out]:
Date NG base
0 2022-10-01 0.144894 EUR
1 2022-10-02 0.144879 EUR
2 2022-10-03 0.145486 EUR
3 2022-10-04 0.150735 EUR
Even though one already has the desired output, one can wrap everything into a one-liner as follows
df = pd.DataFrame.from_dict(data['data']['rates'], orient='index').reset_index().rename(columns={'index':'Date'}).assign(base=data['data']['base'])
[Out]:
Date NG base
0 2022-10-01 0.144894 EUR
1 2022-10-02 0.144879 EUR
2 2022-10-03 0.145486 EUR
3 2022-10-04 0.150735 EUR