I would like to load some JSON data into a pandas dataframe. Normally, i would use pandas.json_normalize
, but I would also like to enforce a scheme (columns and ideally also dtypes) regardless of whether all fields are found in the json documents.
So
one = [
{
"fruit": "apple",
"price": 2,
"color": "red"
},
{
"fruit": "banana",
"color": "yellow"
},
{
"fruit": "orange",
"color": "orange"
}
]
two = [
{
"fruit": "banana",
"color": "yellow"
},
{
"fruit": "orange",
"color": "orange"
}
]
pd.json_normalize(one)
pd.json_normalize(two)
returns
fruit price color
0 apple 2.0 red
1 banana NaN yellow
2 orange NaN orange
fruit color
0 banana yellow
1 orange orange
When I would like to have the price
column in any case, either with NAs or some predefined value. Currently I am doing something like:
def load_fruits(json):
default_fruit = {
"fruit": "",
"color": "",
"price": 0
}
fruits = []
for item in json:
fruit = {
key: item[key]
if key in item.keys()
else default_fruit[key]
for key in default_fruit.keys()
}
fruits.append(fruit)
return pd.DataFrame(fruits)
But I was wondering if there was a more efficient way to handle this for larger datasets. Thanks!
CodePudding user response:
You can use Series.fillna
for each series to fill the NaNs
. It would be something like this from your example:
import pandas as pd
one = [
{
"fruit": "apple",
"price": 2,
"color": "red"
},
{
"fruit": "banana",
"color": "yellow"
},
{
"fruit": "orange",
"color": "orange"
}
]
df1 = pd.json_normalize(one)
df1.fruit = df1.fruit.fillna('')
df1.color = df1.color.fillna('')
df1.price = df1.price.fillna(0)
print(df1)
And the output is:
fruit price color
0 apple 2.0 red
1 banana 0.0 yellow
2 orange 0.0 orange
There's also DataFrame.fillna
that fills the entire DataFrame NaNs
with the value provided.
CodePudding user response:
What's about using Table Schema? You can create a schema for your json, then merge the schema with actual data, and then load the table based on the schema:
one_df=pd.json_normalize(one)
schema=pd.io.json.build_table_schema(one_df, index=False)
import json
two_df = pd.read_json(json.dumps({
'schema': schema,
'data': two,
}), orient="table")