I am working with Python and I have the following JSON which I need to convert to a Dataframe:
JSON:
{"Results":
{"forecast": [2.1632421537363355, 16.35421956127545],
"prediction_interval": ["[-114.9747272420262, 119.30121154949884]",
"[-127.10990770140964, 159.8183468239605]"],
"index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0},
{"SaleDate": 1644364800000, "OfferingGroupId": 1}]
}
}
Expected Dataframe output:
Forecast SaleDate OfferingGroupId
2.1632421537363355 2022-02-08 0
16.35421956127545 2022-02-09 1
I have tried a few things but not getting anywhere close, my last attempt was:
string = '{"Results": {"forecast": [2.1632421537363355, 16.35421956127545], "prediction_interval": ["[-114.9747272420262, 119.30121154949884]", "[-127.10990770140964, 159.8183468239605]"], "index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, {"SaleDate": 1644364800000, "OfferingGroupId": 1}]}}'
json_obj = json.loads(string)
df = pd.DataFrame(json_obj)
print(df)
df = pd.concat([df['Results']], axis=0)
df = pd.concat([df['forecast'], df['index'].apply(pd.Series)], axis=1)
which resulted in an error:
AttributeError: 'list' object has no attribute 'apply'
CodePudding user response:
Not very pretty but I guess you can just throw out all the nesting that makes it complicated by forcing it into an aligned tuple list and then use that:
import json
import pandas as pd
string = '{"Results": {"forecast": [2.1632421537363355, 16.35421956127545], "prediction_interval": ["[-114.9747272420262, 119.30121154949884]", "[-127.10990770140964, 159.8183468239605]"], "index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, {"SaleDate": 1644364800000, "OfferingGroupId": 1}]}}'
results_dict = json.loads(string)["Results"]
results_tuples = zip(results_dict["forecast"],
[d["SaleDate"] for d in results_dict["index"]],
[d["OfferingGroupId"] for d in results_dict["index"]])
df = pd.DataFrame(results_tuples, columns=["Forecast", "SaleDate", "OfferingGroupId"])
df['SaleDate'] = pd.to_datetime(df['SaleDate'], unit='ms')
print(df)
> Forecast SaleDate OfferingGroupId
0 2.163242 2022-02-08 0
1 16.354220 2022-02-09 1
Or the same idea but forcing it into an aligned dict format:
string = '{"Results": {"forecast": [2.1632421537363355, 16.35421956127545], "prediction_interval": ["[-114.9747272420262, 119.30121154949884]", "[-127.10990770140964, 159.8183468239605]"], "index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, {"SaleDate": 1644364800000, "OfferingGroupId": 1}]}}'
results_dict = json.loads(string)["Results"]
results_dict = {"Forecast": results_dict["forecast"],
"SaleDate": [d["SaleDate"] for d in results_dict["index"]],
"OfferingGroupId": [d["OfferingGroupId"] for d in results_dict["index"]]}
df = pd.DataFrame.from_dict(results_dict)
df['SaleDate'] = pd.to_datetime(df['SaleDate'], unit='ms')
print(df)
> Forecast SaleDate OfferingGroupId
0 2.163242 2022-02-08 0
1 16.354220 2022-02-09 1
Generally from my experience letting pandas read a non-intended input format and then using the pandas methods to fix it causes much more of a headache than creating a dict or tuple list format as a middle step and just read that. But that might just be personal preference.
CodePudding user response:
One possible approach is to create a DataFrame from the value under "Results" (this will create a column named "index") and build another DataFrame with the "index" column and join
it back to the original DataFrame:
df = pd.DataFrame(data['Results'])
df = df.join(pd.DataFrame(df['index'].tolist())).drop(columns=['prediction_interval', 'index'])
df['SaleDate'] = pd.to_datetime(df['SaleDate'], unit='ms')
Output:
forecast SaleDate OfferingGroupId
0 2.163242 2022-02-08 0
1 16.354220 2022-02-09 1
CodePudding user response:
Just load the index
as a column, then use tolist()
to export it as two columns and create a new DataFrame. Combine the new dataframe with the original via pd.concat()
.
In this example, I also included columns for prediction_interval
because I figured you might want that, too.
d = {"Results":
{"forecast": [2.1632421537363355, 16.35421956127545],
"prediction_interval": ["[-114.9747272420262, 119.30121154949884]", "[-127.10990770140964, 159.8183468239605]"],
"index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, {"SaleDate": 1644364800000, "OfferingGroupId": 1}]
}
}
res = pd.DataFrame(d['Results'])
sd = pd.DataFrame(res['index'].tolist())
sd['SaleDate'] = pd.to_datetime(sd['SaleDate'], unit='ms')
pi = pd.DataFrame(res['prediction_interval'].map(json.loads).tolist(), columns=['pi_start', 'pi_end'])
df = pd.concat((res, pi, sd), axis=1).drop(columns=['index', 'prediction_interval'])