Home > Enterprise >  Unnest a huge JSON array into individual JSON object
Unnest a huge JSON array into individual JSON object

Time:08-09

I have the a single JSON object as below,

{
    "a": [
        {
            "item1": "item1_value",
            "item2": "item2_value"
        },
        {
            "item1": "item1_value",
            "item2": "item2_value"
        },
        {
            ....
        },
        
        100 million more object
    ]
}

I'm trying to make each element in the array as a separate JSON object as below,

{ "a": { "item1": "item1_value", "item2": "item2_value" } }
{ "a": { "item1": "item1_value", "item2": "item2_value" } }

The raw files has millions of nested objects in a single JSON array, which I want to split into multiple individual JSON.

CodePudding user response:

To process a huge file, possibly larger than what fits into the memory, you can break it down into pieces using the --stream directive. This stream can then be read sequentially using inputs in combination with the --null-input (or -n) flag. To achieve the overall effect of

jq '.a[]' file.json

you need to truncate the streamed parts by stripping off the first two levels of their structure information (essentially their location path: the outer object's a field, and the contained array's indices []). Using fromstream will then reconstruct each entity once read in completely.

jq --stream -n 'fromstream(2 | truncate_stream(inputs))' file.json
{
  "item1": "item1_value",
  "item2": "item2_value"
}
{
  "item1": "item1_value",
  "item2": "item2_value"
}
:

To create your final structure, re-create the resulting object with the output of fromstream, and use the --compact-output (or -c) option to have each object on its separate line:

jq --stream -nc '{a: fromstream(2 | truncate_stream(inputs))}' file.json
{"a":{"item1":"item1_value","item2":"item2_value"}}
{"a":{"item1":"item1_value","item2":"item2_value"}}
:

If you also want the top-level field name (here a) be read in and re-created dynamically, you will have to construct your own stream truncation:

jq --stream -nc '
  fromstream(inputs | if first | has(2) then
    setpath([0]; first | del(.[1])),
    if has(1) then empty else map(.[:1]) end
  else empty end)
' file.json
{"a":{"item1":"item1_value","item2":"item2_value"}}
{"a":{"item1":"item1_value","item2":"item2_value"}}
:
  • Related