Home > Net >  Best way to enforce a scheme when loading json into dataframes
Best way to enforce a scheme when loading json into dataframes

Time:07-02

I would like to load some JSON data into a pandas dataframe. Normally, i would use pandas.json_normalize, but I would also like to enforce a scheme (columns and ideally also dtypes) regardless of whether all fields are found in the json documents.

So

one = [
        {
            "fruit": "apple",
            "price": 2,
            "color": "red"
        },
        {
            "fruit": "banana",
            "color": "yellow"
        },
        {
            "fruit": "orange",
            "color": "orange"
        }
]

two = [
        {
            "fruit": "banana",
            "color": "yellow"
        },
        {
            "fruit": "orange",
            "color": "orange"
        }
]

pd.json_normalize(one)
pd.json_normalize(two)

returns

    fruit  price   color
0   apple    2.0     red
1  banana    NaN  yellow
2  orange    NaN  orange

    fruit   color
0  banana  yellow
1  orange  orange

When I would like to have the price column in any case, either with NAs or some predefined value. Currently I am doing something like:

def load_fruits(json):
    default_fruit = {
        "fruit": "",
        "color": "",
        "price": 0
    }

    fruits = []
    for item in json:
        fruit = {
            key: item[key]
            if key in item.keys()
            else default_fruit[key]
            for key in default_fruit.keys()
        }
        fruits.append(fruit)
    return pd.DataFrame(fruits)

But I was wondering if there was a more efficient way to handle this for larger datasets. Thanks!

CodePudding user response:

You can use Series.fillna for each series to fill the NaNs. It would be something like this from your example:

import pandas as pd

one = [
        {
            "fruit": "apple",
            "price": 2,
            "color": "red"
        },
        {
            "fruit": "banana",
            "color": "yellow"
        },
        {
            "fruit": "orange",
            "color": "orange"
        }
]

df1 = pd.json_normalize(one)
df1.fruit = df1.fruit.fillna('')
df1.color = df1.color.fillna('')
df1.price = df1.price.fillna(0)
print(df1)

And the output is:

    fruit  price   color
0   apple    2.0     red
1  banana    0.0  yellow
2  orange    0.0  orange

There's also DataFrame.fillna that fills the entire DataFrame NaNs with the value provided.

CodePudding user response:

What's about using Table Schema? You can create a schema for your json, then merge the schema with actual data, and then load the table based on the schema:

one_df=pd.json_normalize(one)
schema=pd.io.json.build_table_schema(one_df, index=False)

import json
two_df = pd.read_json(json.dumps({
    'schema': schema,
    'data': two,
}), orient="table")

  • Related