Best way to enforce a scheme when loading json into dataframes-CodePudding

I would like to load some JSON data into a pandas dataframe. Normally, i would use pandas.json_normalize, but I would also like to enforce a scheme (columns and ideally also dtypes) regardless of whether all fields are found in the json documents.

one = [
        {
            "fruit": "apple",
            "price": 2,
            "color": "red"
        },
        {
            "fruit": "banana",
            "color": "yellow"
        },
        {
            "fruit": "orange",
            "color": "orange"
        }
]

two = [
        {
            "fruit": "banana",
            "color": "yellow"
        },
        {
            "fruit": "orange",
            "color": "orange"
        }
]

pd.json_normalize(one)
pd.json_normalize(two)

returns

    fruit  price   color
0   apple    2.0     red
1  banana    NaN  yellow
2  orange    NaN  orange

    fruit   color
0  banana  yellow
1  orange  orange

When I would like to have the price column in any case, either with NAs or some predefined value. Currently I am doing something like:

def load_fruits(json):
    default_fruit = {
        "fruit": "",
        "color": "",
        "price": 0
    }

    fruits = []
    for item in json:
        fruit = {
            key: item[key]
            if key in item.keys()
            else default_fruit[key]
            for key in default_fruit.keys()
        }
        fruits.append(fruit)
    return pd.DataFrame(fruits)

But I was wondering if there was a more efficient way to handle this for larger datasets. Thanks!

CodePudding user response：

You can use Series.fillna for each series to fill the NaNs. It would be something like this from your example:

import pandas as pd

one = [
        {
            "fruit": "apple",
            "price": 2,
            "color": "red"
        },
        {
            "fruit": "banana",
            "color": "yellow"
        },
        {
            "fruit": "orange",
            "color": "orange"
        }
]

df1 = pd.json_normalize(one)
df1.fruit = df1.fruit.fillna('')
df1.color = df1.color.fillna('')
df1.price = df1.price.fillna(0)
print(df1)

And the output is:

    fruit  price   color
0   apple    2.0     red
1  banana    0.0  yellow
2  orange    0.0  orange

There's also DataFrame.fillna that fills the entire DataFrame NaNs with the value provided.

CodePudding user response：

What's about using Table Schema? You can create a schema for your json, then merge the schema with actual data, and then load the table based on the schema:

one_df=pd.json_normalize(one)
schema=pd.io.json.build_table_schema(one_df, index=False)

import json
two_df = pd.read_json(json.dumps({
    'schema': schema,
    'data': two,
}), orient="table")