Home > front end >  Read nested JSON into Dask DataFrame
Read nested JSON into Dask DataFrame

Time:04-13

I am trying to read nested JSON into a Dask DataFrame, preferably with code that'll do the heavy lifting.

Here's the JSON file I am reading:

{
    "data": [{
        "name": "george",
        "age": 16,
        "exams": [{
                "subject": "geometry",
                "score": 56
            },
            {
                "subject": "poetry",
                "score": 88
            }
        ]

    }, {
        "name": "nora",
        "age": 7,
        "exams": [{
                "subject": "geometry",
                "score": 87
            },
            {
                "subject": "poetry",
                "score": 94
            }
        ]
    }]
}

Here is the resulting DataFrame I would like.

name age exam_subject exam_score
george 16 geometry 56
george 16 poetry 88
nora 7 geometry 87
nora 7 poetry 94

Here's how I'd accomplish this with pandas:

df = pd.read_json("students3.json", orient="split")
exploded = df.explode("exams")
pd.concat([exploded[["name", "age"]].reset_index(drop=True), pd.json_normalize(exploded["exams"])], axis=1)

Dask doesn't have json_normalize, so what's the best way to accomplish this task?

CodePudding user response:

If the file contains json-lines, then the most scale-able approach is to use dask.bag and then map the pandas snippet across each bag partition.

If the file is a large json, then the opening/ending brackets will cause problems, so an additional function will be needed to remove them before mapping the text into json.

Rough pseudo-code:

import dask.bag as db

bag = db.read_text("students3.json")

# if there are json-lines 
option1 = bag.map(json.loads).map(pandas_fn)

# if there is a single json
option2 = bag.map(convert_to_jsonlines).map(json.loads).map(pandas_fn)

CodePudding user response:

Use pd.json_normalize

import json
import pandas as pd

with open('students3.json', 'r', encoding='utf-8') as f:
    data = json.loads(f.read())

df = pd.json_normalize(data['data'], record_path='exams', meta=['name', 'age'])
    subject  score    name age
0  geometry     56  george  16
1    poetry     88  george  16
2  geometry     87    nora   7
3    poetry     94    nora   7
  • Related