Home > Blockchain >  Parse enormous json file, output objects to their own files
Parse enormous json file, output objects to their own files

Time:09-16

I have json files of arbitrary size, some small, some enormous (>40GB). I'm trying to use jq to stream objects from an object_array to their own files for later processing.

object_array has N objects.

The json has the following structure:

{
    "top_level_name": "Some Name Here",
    "top_level_type": "Some Type",
    "last_updated_on": "2022-07-09",
    "version": "1.0.0",
    "object_array": [
        {
            "arrangement": "abcd",
            "name": "another name",
            "type": "efgh",
            "type_ver": "2021",
            "code": "12345",
            "desc": "some description",
            "rate": [
                {
                    "groups": [
                        {
                            "IDs": [
                                "123654890",
                                "012365485"
                            ],
                            "id_type": {
                                "type": "xyz",
                                "value": "8527419630"
                            }
                        }
                    ],
                    "prices": [
                        {
                            "price_type": "priceType",
                            "rate": "00.00",
                            "date": "2023-01-01",
                            "svc_code": [
                                "89"
                            ],
                            "class": "some-class",
                            "modifier": [
                                "78"
                            ],
                            "additional_information": "null"
                        }
                    ]
                }
            ]
        }
    ]
}

The current path I'm on has me trying jq -cn --stream 'fromstream(1|truncate_stream(inputs))' test.json| awk '{print > "onif00" NR ".json"}' but my results vary from a single file with all the data or a ton of files each with some piece of the data e.g. [ followed by another file with { etc.

Specifically, I'd like to capture each object like the one below, from object_array, and place it in its own file

{
    "arrangement": "abcd",
    "name": "another name",
    "type": "efgh",
    "type_ver": "2021",
    "code": "12345",
    "desc": "some description",
    "rate": [{
        "groups": [{
            "IDs": [
                "123654890",
                "012365485"
            ],
            "id_type": {
                "type": "xyz",
                "value": "8527419630"
            }
        }],
        "prices": [{
            "price_type": "priceType",
            "rate": "00.00",
            "date": "2023-01-01",
            "svc_code": [
                "89"
            ],
            "class": "some-class",
            "modifier": [
                "78"
            ],
            "additional_information": "null"
        }]
    }]
}

CodePudding user response:

As in the Q, I'd avoid calling jq more than once. I also like using awk, so if for example you want .name as part of the file name, I'd go with s.t. like:

< my.json jq -cnr --stream '
    fromstream(2|truncate_stream(inputs | select(.[0][0] == "object_array")) )
    | .name, .
' | awk 'fn=="" {fn=$1; next} {print > "onif00_" fn ".json"; fn=""; }'

This has also been tested using gojq, the Go implementation of jq.

See @Jeff_Mercado's comment elsewhere on this page re adapting this for versions of jq which support --stream but for which fromstream is buggy.

CodePudding user response:

In jq, select from the stream those parts that belong to the object_array, then truncate by 2 (field name and array index). In the shell, read the lines provided by jq -c and redirect them to a file. With another jq call you could extract a file name (here I took the .name value and added .json, resulting in a file called another name.json):

# Example for bash using heredoc

< enormous.json jq -cn --stream '
  fromstream(2|truncate_stream(inputs| select(.[0][0] == "object_array")))
' | while read -r json;
    do >"$(jq -r '.name' <<< "$json").json" cat <<< "$json"
    done
  • Related