I have the a single JSON object as below,
{
"a": [
{
"item1": "item1_value",
"item2": "item2_value"
},
{
"item1": "item1_value",
"item2": "item2_value"
},
{
....
},
100 million more object
]
}
I'm trying to make each element in the array as a separate JSON object as below,
{ "a": { "item1": "item1_value", "item2": "item2_value" } }
{ "a": { "item1": "item1_value", "item2": "item2_value" } }
The raw files has millions of nested objects in a single JSON array, which I want to split into multiple individual JSON.
CodePudding user response:
To process a huge file, possibly larger than what fits into the memory, you can break it down into pieces using the --stream
directive. This stream can then be read sequentially using inputs
in combination with the --null-input
(or -n
) flag. To achieve the overall effect of
jq '.a[]' file.json
you need to truncate the streamed parts by stripping off the first two levels of their structure information (essentially their location path: the outer object's a
field, and the contained array's indices []
). Using fromstream
will then reconstruct each entity once read in completely.
jq --stream -n 'fromstream(2 | truncate_stream(inputs))' file.json
{
"item1": "item1_value",
"item2": "item2_value"
}
{
"item1": "item1_value",
"item2": "item2_value"
}
:
To create your final structure, re-create the resulting object with the output of fromstream
, and use the --compact-output
(or -c
) option to have each object on its separate line:
jq --stream -nc '{a: fromstream(2 | truncate_stream(inputs))}' file.json
{"a":{"item1":"item1_value","item2":"item2_value"}}
{"a":{"item1":"item1_value","item2":"item2_value"}}
:
If you also want the top-level field name (here a
) be read in and re-created dynamically, you will have to construct your own stream truncation:
jq --stream -nc '
fromstream(inputs | if first | has(2) then
setpath([0]; first | del(.[1])),
if has(1) then empty else map(.[:1]) end
else empty end)
' file.json
{"a":{"item1":"item1_value","item2":"item2_value"}}
{"a":{"item1":"item1_value","item2":"item2_value"}}
: