I have a very large json document (~100 GB) that I am trying to use jq
to parse out specific objects that meet a given criteria. Because it is so large, I won't be able to read it into memory, and will need to utilize the --stream
option.
I understand how to run a select
to extract what I need when I'm not streaming, but could use some assistance in figuring out how to configure my command correctly.
Here's a sample of my document named example.json
.
{
"reporting_entity_name" : "INSURANCE COMPANY",
"reporting_entity_type" : "INSURER",
"last_updated_on" : "2022-12-01",
"version" : "1.0.0",
"in_network" : [ {
"negotiation_arrangement" : "ffs",
"name" : "ER VISIT",
"billing_code_type" : "CPT",
"billing_code_type_version" : "2022",
"billing_code" : "99285",
"description" : "HIGHEST LEVEL ER VISIT",
"negotiated_rates" : [ {
"provider_groups" : [ {
"npi" : [ 111111111, 222222222],
"tin" : {
"type" : "ein",
"value" : "99-9999999"
}
} ],
"negotiated_prices" : [ {
"negotiated_type" : "negotiated",
"negotiated_rate" : 550.50,
"expiration_date" : "9999-12-31",
"service_code" : [ "23" ],
"billing_class" : "institutional"
} ]
} ]
}
]
}
I am trying to grab the in_network
object where billing_code
is equal to 99285.
If I was able to do this without streaming, here's how I would approach it:
jq '.in_network[] | select(.billing_code == "99285")' example.json
Expected output:
{
"negotiation_arrangement": "ffs",
"name": "ER VISIT",
"billing_code_type": "CPT",
"billing_code_type_version": "2022",
"billing_code": "99285",
"description": "HIGHEST LEVEL ER VISIT",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
111111111,
222222222
],
"tin": {
"type": "ein",
"value": "99-9999999"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 550.5,
"expiration_date": "9999-12-31",
"service_code": [
"23"
],
"billing_class": "institutional"
}
]
}
]
}
Any help on how I could configure this with the --stream
option would be greatly appreciated!
CodePudding user response:
If the objects from the .in_network
array alone do fit into your memory, truncate at the array items (two levels deep):
jq --stream -n '
fromstream(2|truncate_stream(inputs | select(.[0][0] == "in_network")))
| select(.billing_code == "99285")
' example.json
{
"negotiation_arrangement": "ffs",
"name": "ER VISIT",
"billing_code_type": "CPT",
"billing_code_type_version": "2022",
"billing_code": "99285",
"description": "HIGHEST LEVEL ER VISIT",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
111111111,
222222222
],
"tin": {
"type": "ein",
"value": "99-9999999"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 550.5,
"expiration_date": "9999-12-31",
"service_code": [
"23"
],
"billing_class": "institutional"
}
]
}
]
}
CodePudding user response:
You will find jq —stream
excruciatingly slow. Since jq is intended to complement other shell tools, I would recommend using jstream (https://github.com/bcicen/jstream), or my own jm or jm.py (https://github.com/pkoppstein/jm), to ”splat” the array, and pipe the result to jq.
E.g. to achieve the same effect as your jq filter:
jm —-pointer /in_network example.json |
jq 'select(.billing_code == "99285")'