How to use `select` within a jq --stream command?-CodePudding

I have a very large json document (~100 GB) that I am trying to use jq to parse out specific objects that meet a given criteria. Because it is so large, I won't be able to read it into memory, and will need to utilize the --stream option.

I understand how to run a select to extract what I need when I'm not streaming, but could use some assistance in figuring out how to configure my command correctly.

Here's a sample of my document named example.json.

{
  "reporting_entity_name" : "INSURANCE COMPANY",
  "reporting_entity_type" : "INSURER",
  "last_updated_on" : "2022-12-01",
  "version" : "1.0.0",
  "in_network" : [ {
    "negotiation_arrangement" : "ffs",
    "name" : "ER VISIT",
    "billing_code_type" : "CPT",
    "billing_code_type_version" : "2022",
    "billing_code" : "99285",
    "description" : "HIGHEST LEVEL ER VISIT",
    "negotiated_rates" : [ {
      "provider_groups" : [ {
        "npi" : [ 111111111, 222222222],
        "tin" : {
          "type" : "ein",
          "value" : "99-9999999"
        }
      } ],
      "negotiated_prices" : [ {
        "negotiated_type" : "negotiated",
        "negotiated_rate" : 550.50,
        "expiration_date" : "9999-12-31",
        "service_code" : [ "23" ],
        "billing_class" : "institutional"
      } ]
    } ]
  }
]
}

I am trying to grab the in_network object where billing_code is equal to 99285.

If I was able to do this without streaming, here's how I would approach it:

jq '.in_network[] | select(.billing_code == "99285")' example.json

Expected output:

{
  "negotiation_arrangement": "ffs",
  "name": "ER VISIT",
  "billing_code_type": "CPT",
  "billing_code_type_version": "2022",
  "billing_code": "99285",
  "description": "HIGHEST LEVEL ER VISIT",
  "negotiated_rates": [
    {
      "provider_groups": [
        {
          "npi": [
            111111111,
            222222222
          ],
          "tin": {
            "type": "ein",
            "value": "99-9999999"
          }
        }
      ],
      "negotiated_prices": [
        {
          "negotiated_type": "negotiated",
          "negotiated_rate": 550.5,
          "expiration_date": "9999-12-31",
          "service_code": [
            "23"
          ],
          "billing_class": "institutional"
        }
      ]
    }
  ]
}

Any help on how I could configure this with the --stream option would be greatly appreciated!

CodePudding user response：

If the objects from the .in_network array alone do fit into your memory, truncate at the array items (two levels deep):

jq --stream -n '
  fromstream(2|truncate_stream(inputs | select(.[0][0] == "in_network")))
  | select(.billing_code == "99285")
' example.json

{
  "negotiation_arrangement": "ffs",
  "name": "ER VISIT",
  "billing_code_type": "CPT",
  "billing_code_type_version": "2022",
  "billing_code": "99285",
  "description": "HIGHEST LEVEL ER VISIT",
  "negotiated_rates": [
    {
      "provider_groups": [
        {
          "npi": [
            111111111,
            222222222
          ],
          "tin": {
            "type": "ein",
            "value": "99-9999999"
          }
        }
      ],
      "negotiated_prices": [
        {
          "negotiated_type": "negotiated",
          "negotiated_rate": 550.5,
          "expiration_date": "9999-12-31",
          "service_code": [
            "23"
          ],
          "billing_class": "institutional"
        }
      ]
    }
  ]
}

CodePudding user response：

You will find jq —stream excruciatingly slow. Since jq is intended to complement other shell tools, I would recommend using jstream (https://github.com/bcicen/jstream), or my own jm or jm.py (https://github.com/pkoppstein/jm), to ”splat” the array, and pipe the result to jq.

E.g. to achieve the same effect as your jq filter:

jm —-pointer /in_network example.json | 
  jq 'select(.billing_code == "99285")'