Flatten deeply nested JSON into multiple rows-CodePudding

I am trying to take the following deeply nested JSON and turn it into (eventually) a csv. Below is just a sample, the full JSON is huge (12GB).

{'reporting_entity_name':'Blue Cross and Blue Shield of Alabama', 
'reporting_entity_type':'health insurance issuer', 
'last_updated_on':'2022-06-10',
'version':'1.1.0', 
'in_network':[
            {'negotiation_arrangement': 'ffs', 
            'name': 'xploration of Kidney', 
            'billing_code_type': 'CPT', 
            'billing_code_type_version': '2022', 
            'billing_code': '50010', 
            'description': 'Renal Exploration, Not Necessitating Other Specific Procedures', 
            'negotiated_rates': [{
                'negotiated_prices': [
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 993.0, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1180.68, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1283.95, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1042.65, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1290.9, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1241.25, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}
                    }]}]}

The end goal is to have a data frame, or dictionary, that I can then write to a csv. I hope to have each row with the columns:

{'reporting_entity_name':'','reporting_entity_type':'','last_updated_on':'','version':'','negotiation_arrangement':'','name':'','billing_code_type':'','billing_code_type_version':'','billing_code':'','description':'','provider_groups':'','negotiated_type':'','negotiated_rate':'','expiration_date':'','service_code':'','billing_class':''}

So far I have tried pandas normalize_json, flatten, and a few custom modules I found on GitHub. But none seem to normalize/flatten the data into new rows, only columns. Because this is such a huge dataset I'm worried about doing this recursively in a bunch of nested loops because I fear it will quickly eat up all my memory. Thanks in advance for any advice you can offer!

CodePudding user response：

First of all, I must confess, I'm the author of the library I'm going to use for this -- convtools (github).

Presuming your data is a list of samples you provided:

input_data = [
    {
        "reporting_entity_name": "blue cross and blue shield of alabama",
        "reporting_entity_type": "health insurance issuer",
        "last_updated_on": "2022-06-10",
        "version": "1.1.0",
        "in_network": [
            {
                "negotiation_arrangement": "ffs",
                "name": "xploration of kidney",
                "billing_code_type": "cpt",
                "billing_code_type_version": "2022",
                "billing_code": "50010",
                "description": "renal exploration, not necessitating other specific procedures",
                "negotiated_rates": [
                    {
                        "negotiated_prices": [
                            {
                                "negotiated_type": "negotiated",
                                "negotiated_rate": 993.0,
                                "expiration_date": "2022-06-30",
                                "service_code": ["21", "22", "24"],
                                "billing_class": "professional",
                            },
                            {
                                "negotiated_type": "negotiated",
                                "negotiated_rate": 1180.68,
                                "expiration_date": "2022-06-30",
                                "service_code": ["21", "22", "24"],
                                "billing_class": "professional",
                            },
                        ]
                    }
                ],
            }
        ],
    }
]

I'm defining kind of a config you can tweak to add/remove fields to be retrieved per level. Then the function recursively builds a converter, which is then compiled to an ad hoc python function without recursion.

from convtools.contrib.tables import Table
from convtools import conversion as c

flattening_levels = [
    {
        "fields": ["reporting_entity_name", "reporting_entity_type"],
        "list_field": "in_network",
    },
    {
        "fields": ["negotiation_arrangement", "name"],
        "list_field": "negotiated_rates",
    },
    {"list_field": "negotiated_prices"},
    {
        "fields": ["negotiated_type", "negotiated_rate"],
    },
]

def build_conversion(levels, index=0):
    level_config = levels[index]

    # basic step is to iterate an input
    return c.iter(
        c.zip(
            # zip nested list
            (
                c.item(level_config["list_field"]).pipe(
                    build_conversion(levels, index   1)
                )
                if "list_field" in level_config
                else c.naive([{}])
            ),
            # with repeated fields of the current level
            c.repeat(
                {
                    field: c.item(field)
                    for field in level_config.get("fields", ())
                }
            ),
        )
        # update inner objects with fields from the current level
        .iter_mut(c.Mut.custom(c.item(0).call_method("update", c.item(1))))
        # take inner objects, forgetting about the top level
        .iter(c.item(0))
    ).flatten()

flatten_objects = build_conversion(flattening_levels).gen_converter()

Table.from_rows(flatten_objects(input_data)).into_csv("out.csv")

out.csv looks like:

negotiated_type,negotiated_rate,negotiation_arrangement,name,reporting_entity_name,reporting_entity_type
negotiated,993.0,ffs,xploration of kidney,blue cross and blue shield of alabama,health insurance issuer
negotiated,1180.68,ffs,xploration of kidney,blue cross and blue shield of alabama,health insurance issuer

Let's profile:

import tqdm

many_objects = input_data * 1000000

"""
In [151]: %time Table.from_rows(tqdm.tqdm(flatten_objects(many_objects))).into_csv("out.csv")

2000000it [00:07, 274094.52it/s]
CPU times: user 7.13 s, sys: 141 ms, total: 7.27 s
Wall time: 7.3 s
"""

# resulting file is ~210MB

This solution can be optimized not to update dicts multiple times. Please, let me know if it's needed.