Convert JSON into dataframe-CodePudding

I am working with Python and I have the following JSON which I need to convert to a Dataframe:

JSON:

{"Results": 
        {"forecast": [2.1632421537363355, 16.35421956127545], 
         "prediction_interval": ["[-114.9747272420262, 119.30121154949884]", 
                                 "[-127.10990770140964, 159.8183468239605]"], 
         "index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, 
                   {"SaleDate": 1644364800000, "OfferingGroupId": 1}]
        }
}

Expected Dataframe output:

Forecast                 SaleDate     OfferingGroupId
2.1632421537363355       2022-02-08    0
16.35421956127545        2022-02-09    1

I have tried a few things but not getting anywhere close, my last attempt was:

string = '{"Results": {"forecast": [2.1632421537363355, 16.35421956127545], "prediction_interval": ["[-114.9747272420262, 119.30121154949884]", "[-127.10990770140964, 159.8183468239605]"], "index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, {"SaleDate": 1644364800000, "OfferingGroupId": 1}]}}'
json_obj = json.loads(string)
df = pd.DataFrame(json_obj)
print(df)

df = pd.concat([df['Results']], axis=0)
df = pd.concat([df['forecast'], df['index'].apply(pd.Series)], axis=1)

which resulted in an error:

AttributeError: 'list' object has no attribute 'apply'

CodePudding user response：

Not very pretty but I guess you can just throw out all the nesting that makes it complicated by forcing it into an aligned tuple list and then use that:

import json
import pandas as pd

string = '{"Results": {"forecast": [2.1632421537363355, 16.35421956127545], "prediction_interval": ["[-114.9747272420262, 119.30121154949884]", "[-127.10990770140964, 159.8183468239605]"], "index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, {"SaleDate": 1644364800000, "OfferingGroupId": 1}]}}'
results_dict = json.loads(string)["Results"]
results_tuples = zip(results_dict["forecast"],
                     [d["SaleDate"] for d in results_dict["index"]],
                     [d["OfferingGroupId"] for d in results_dict["index"]])
df = pd.DataFrame(results_tuples, columns=["Forecast", "SaleDate", "OfferingGroupId"])
df['SaleDate'] = pd.to_datetime(df['SaleDate'], unit='ms')
print(df)
>     Forecast   SaleDate  OfferingGroupId
  0   2.163242 2022-02-08                0
  1  16.354220 2022-02-09                1

Or the same idea but forcing it into an aligned dict format:

string = '{"Results": {"forecast": [2.1632421537363355, 16.35421956127545], "prediction_interval": ["[-114.9747272420262, 119.30121154949884]", "[-127.10990770140964, 159.8183468239605]"], "index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, {"SaleDate": 1644364800000, "OfferingGroupId": 1}]}}'
results_dict = json.loads(string)["Results"]
results_dict = {"Forecast": results_dict["forecast"],
                "SaleDate": [d["SaleDate"] for d in results_dict["index"]],
                "OfferingGroupId": [d["OfferingGroupId"] for d in results_dict["index"]]}
df = pd.DataFrame.from_dict(results_dict)
df['SaleDate'] = pd.to_datetime(df['SaleDate'], unit='ms')
print(df)
>     Forecast   SaleDate  OfferingGroupId
  0   2.163242 2022-02-08                0
  1  16.354220 2022-02-09                1

Generally from my experience letting pandas read a non-intended input format and then using the pandas methods to fix it causes much more of a headache than creating a dict or tuple list format as a middle step and just read that. But that might just be personal preference.

CodePudding user response：

One possible approach is to create a DataFrame from the value under "Results" (this will create a column named "index") and build another DataFrame with the "index" column and join it back to the original DataFrame:

df = pd.DataFrame(data['Results'])
df = df.join(pd.DataFrame(df['index'].tolist())).drop(columns=['prediction_interval', 'index'])
df['SaleDate'] = pd.to_datetime(df['SaleDate'], unit='ms')

Output:

    forecast   SaleDate  OfferingGroupId
0   2.163242 2022-02-08                0
1  16.354220 2022-02-09                1

CodePudding user response：

Just load the index as a column, then use tolist() to export it as two columns and create a new DataFrame. Combine the new dataframe with the original via pd.concat().

In this example, I also included columns for prediction_interval because I figured you might want that, too.

d = {"Results":
        {"forecast": [2.1632421537363355, 16.35421956127545],
         "prediction_interval": ["[-114.9747272420262, 119.30121154949884]", "[-127.10990770140964, 159.8183468239605]"],
         "index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, {"SaleDate": 1644364800000, "OfferingGroupId": 1}]
        }
}

res = pd.DataFrame(d['Results'])

sd = pd.DataFrame(res['index'].tolist())
sd['SaleDate'] = pd.to_datetime(sd['SaleDate'], unit='ms')

pi = pd.DataFrame(res['prediction_interval'].map(json.loads).tolist(), columns=['pi_start', 'pi_end'])

df = pd.concat((res, pi, sd), axis=1).drop(columns=['index', 'prediction_interval'])