I have json files of arbitrary size, some small, some enormous (>40GB). I'm trying to use jq
to stream objects from an object_array
to their own files for later processing.
object_array
has N
objects.
The json has the following structure:
{
"top_level_name": "Some Name Here",
"top_level_type": "Some Type",
"last_updated_on": "2022-07-09",
"version": "1.0.0",
"object_array": [
{
"arrangement": "abcd",
"name": "another name",
"type": "efgh",
"type_ver": "2021",
"code": "12345",
"desc": "some description",
"rate": [
{
"groups": [
{
"IDs": [
"123654890",
"012365485"
],
"id_type": {
"type": "xyz",
"value": "8527419630"
}
}
],
"prices": [
{
"price_type": "priceType",
"rate": "00.00",
"date": "2023-01-01",
"svc_code": [
"89"
],
"class": "some-class",
"modifier": [
"78"
],
"additional_information": "null"
}
]
}
]
}
]
}
The current path I'm on has me trying jq -cn --stream 'fromstream(1|truncate_stream(inputs))' test.json| awk '{print > "onif00" NR ".json"}'
but my results vary from a single file with all the data or a ton of files each with some piece of the data e.g. [
followed by another file with {
etc.
Specifically, I'd like to capture each object like the one below, from object_array
, and place it in its own file
{
"arrangement": "abcd",
"name": "another name",
"type": "efgh",
"type_ver": "2021",
"code": "12345",
"desc": "some description",
"rate": [{
"groups": [{
"IDs": [
"123654890",
"012365485"
],
"id_type": {
"type": "xyz",
"value": "8527419630"
}
}],
"prices": [{
"price_type": "priceType",
"rate": "00.00",
"date": "2023-01-01",
"svc_code": [
"89"
],
"class": "some-class",
"modifier": [
"78"
],
"additional_information": "null"
}]
}]
}
CodePudding user response:
As in the Q, I'd avoid calling jq more than once. I also like using awk, so if for example you want .name as part of the file name, I'd go with s.t. like:
< my.json jq -cnr --stream '
fromstream(2|truncate_stream(inputs | select(.[0][0] == "object_array")) )
| .name, .
' | awk 'fn=="" {fn=$1; next} {print > "onif00_" fn ".json"; fn=""; }'
This has also been tested using gojq, the Go implementation of jq.
See @Jeff_Mercado's comment elsewhere on this page re adapting this for versions of jq which support --stream but for which fromstream
is buggy.
CodePudding user response:
In jq, select
from the stream those parts that belong to the object_array
, then truncate by 2
(field name and array index).
In the shell, read the lines provided by jq -c
and redirect them to a file. With another jq call you could extract a file name (here I took the .name
value and added .json
, resulting in a file called another name.json
):
# Example for bash using heredoc
< enormous.json jq -cn --stream '
fromstream(2|truncate_stream(inputs| select(.[0][0] == "object_array")))
' | while read -r json;
do >"$(jq -r '.name' <<< "$json").json" cat <<< "$json"
done