I am trying to take the following deeply nested JSON and turn it into (eventually) a csv. Below is just a sample, the full JSON is huge (12GB).
{'reporting_entity_name':'Blue Cross and Blue Shield of Alabama',
'reporting_entity_type':'health insurance issuer',
'last_updated_on':'2022-06-10',
'version':'1.1.0',
'in_network':[
{'negotiation_arrangement': 'ffs',
'name': 'xploration of Kidney',
'billing_code_type': 'CPT',
'billing_code_type_version': '2022',
'billing_code': '50010',
'description': 'Renal Exploration, Not Necessitating Other Specific Procedures',
'negotiated_rates': [{
'negotiated_prices': [
{'negotiated_type': 'negotiated',
'negotiated_rate': 993.0,
'expiration_date': '2022-06-30',
'service_code': ['21', '22', '24'],
'billing_class': 'professional'},
{'negotiated_type': 'negotiated',
'negotiated_rate': 1180.68,
'expiration_date': '2022-06-30',
'service_code': ['21', '22', '24'],
'billing_class': 'professional'},
{'negotiated_type': 'negotiated',
'negotiated_rate': 1283.95,
'expiration_date': '2022-06-30',
'service_code': ['21', '22', '24'],
'billing_class': 'professional'},
{'negotiated_type': 'negotiated',
'negotiated_rate': 1042.65,
'expiration_date': '2022-06-30',
'service_code': ['21', '22', '24'],
'billing_class': 'professional'},
{'negotiated_type': 'negotiated',
'negotiated_rate': 1290.9,
'expiration_date': '2022-06-30',
'service_code': ['21', '22', '24'],
'billing_class': 'professional'},
{'negotiated_type': 'negotiated',
'negotiated_rate': 1241.25,
'expiration_date': '2022-06-30',
'service_code': ['21', '22', '24'],
'billing_class': 'professional'}
}]}]}
The end goal is to have a data frame, or dictionary, that I can then write to a csv. I hope to have each row with the columns:
{'reporting_entity_name':'','reporting_entity_type':'','last_updated_on':'','version':'','negotiation_arrangement':'','name':'','billing_code_type':'','billing_code_type_version':'','billing_code':'','description':'','provider_groups':'','negotiated_type':'','negotiated_rate':'','expiration_date':'','service_code':'','billing_class':''}
So far I have tried pandas normalize_json, flatten, and a few custom modules I found on GitHub. But none seem to normalize/flatten the data into new rows, only columns. Because this is such a huge dataset I'm worried about doing this recursively in a bunch of nested loops because I fear it will quickly eat up all my memory. Thanks in advance for any advice you can offer!
CodePudding user response:
First of all, I must confess, I'm the author of the library I'm going to use for this -- convtools (github).
Presuming your data is a list of samples you provided:
input_data = [
{
"reporting_entity_name": "blue cross and blue shield of alabama",
"reporting_entity_type": "health insurance issuer",
"last_updated_on": "2022-06-10",
"version": "1.1.0",
"in_network": [
{
"negotiation_arrangement": "ffs",
"name": "xploration of kidney",
"billing_code_type": "cpt",
"billing_code_type_version": "2022",
"billing_code": "50010",
"description": "renal exploration, not necessitating other specific procedures",
"negotiated_rates": [
{
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 993.0,
"expiration_date": "2022-06-30",
"service_code": ["21", "22", "24"],
"billing_class": "professional",
},
{
"negotiated_type": "negotiated",
"negotiated_rate": 1180.68,
"expiration_date": "2022-06-30",
"service_code": ["21", "22", "24"],
"billing_class": "professional",
},
]
}
],
}
],
}
]
I'm defining kind of a config you can tweak to add/remove fields to be retrieved per level. Then the function recursively builds a converter, which is then compiled to an ad hoc python function without recursion.
from convtools.contrib.tables import Table
from convtools import conversion as c
flattening_levels = [
{
"fields": ["reporting_entity_name", "reporting_entity_type"],
"list_field": "in_network",
},
{
"fields": ["negotiation_arrangement", "name"],
"list_field": "negotiated_rates",
},
{"list_field": "negotiated_prices"},
{
"fields": ["negotiated_type", "negotiated_rate"],
},
]
def build_conversion(levels, index=0):
level_config = levels[index]
# basic step is to iterate an input
return c.iter(
c.zip(
# zip nested list
(
c.item(level_config["list_field"]).pipe(
build_conversion(levels, index 1)
)
if "list_field" in level_config
else c.naive([{}])
),
# with repeated fields of the current level
c.repeat(
{
field: c.item(field)
for field in level_config.get("fields", ())
}
),
)
# update inner objects with fields from the current level
.iter_mut(c.Mut.custom(c.item(0).call_method("update", c.item(1))))
# take inner objects, forgetting about the top level
.iter(c.item(0))
).flatten()
flatten_objects = build_conversion(flattening_levels).gen_converter()
Table.from_rows(flatten_objects(input_data)).into_csv("out.csv")
out.csv
looks like:
negotiated_type,negotiated_rate,negotiation_arrangement,name,reporting_entity_name,reporting_entity_type
negotiated,993.0,ffs,xploration of kidney,blue cross and blue shield of alabama,health insurance issuer
negotiated,1180.68,ffs,xploration of kidney,blue cross and blue shield of alabama,health insurance issuer
Let's profile:
import tqdm
many_objects = input_data * 1000000
"""
In [151]: %time Table.from_rows(tqdm.tqdm(flatten_objects(many_objects))).into_csv("out.csv")
2000000it [00:07, 274094.52it/s]
CPU times: user 7.13 s, sys: 141 ms, total: 7.27 s
Wall time: 7.3 s
"""
# resulting file is ~210MB
This solution can be optimized not to update dicts multiple times. Please, let me know if it's needed.