I am trying to read nested JSON into a Dask DataFrame, preferably with code that'll do the heavy lifting.
Here's the JSON file I am reading:
{
"data": [{
"name": "george",
"age": 16,
"exams": [{
"subject": "geometry",
"score": 56
},
{
"subject": "poetry",
"score": 88
}
]
}, {
"name": "nora",
"age": 7,
"exams": [{
"subject": "geometry",
"score": 87
},
{
"subject": "poetry",
"score": 94
}
]
}]
}
Here is the resulting DataFrame I would like.
name | age | exam_subject | exam_score |
---|---|---|---|
george | 16 | geometry | 56 |
george | 16 | poetry | 88 |
nora | 7 | geometry | 87 |
nora | 7 | poetry | 94 |
Here's how I'd accomplish this with pandas:
df = pd.read_json("students3.json", orient="split")
exploded = df.explode("exams")
pd.concat([exploded[["name", "age"]].reset_index(drop=True), pd.json_normalize(exploded["exams"])], axis=1)
Dask doesn't have json_normalize
, so what's the best way to accomplish this task?
CodePudding user response:
If the file contains json-lines, then the most scale-able approach is to use dask.bag
and then map the pandas
snippet across each bag partition.
If the file is a large json, then the opening/ending brackets will cause problems, so an additional function will be needed to remove them before mapping the text into json.
Rough pseudo-code:
import dask.bag as db
bag = db.read_text("students3.json")
# if there are json-lines
option1 = bag.map(json.loads).map(pandas_fn)
# if there is a single json
option2 = bag.map(convert_to_jsonlines).map(json.loads).map(pandas_fn)
CodePudding user response:
Use pd.json_normalize
import json
import pandas as pd
with open('students3.json', 'r', encoding='utf-8') as f:
data = json.loads(f.read())
df = pd.json_normalize(data['data'], record_path='exams', meta=['name', 'age'])
subject score name age
0 geometry 56 george 16
1 poetry 88 george 16
2 geometry 87 nora 7
3 poetry 94 nora 7