I have a very large json file I would like to stream (using --stream
) and filter with jq, then save it as a csv.
This is the sample data with two objects:
[{"_id":"1","time":"2021-07-22","body":["text1"],"region":[{"percentage":"0.1","region":"region1"},{“percentage":"0.9","region":"region2"}],"reach":{"lower_bound":"100","upper_bound":"200"},"languages":["de"]},
{"_id":"2","time":"2021-07-23","body":["text2"],"region":[{"percentage":"0.3","region":"region1"},{“percentage":"0.7","region":"region2"}],"reach":{"lower_bound":"10","upper_bound":"20"},"languages":["en"]}]
I want to filter on the "languages"
field in jq stream so I only retain objects where languages==[“de”]
, then save it as a new csv file titled largefile.csv
such that the new csv file looks like the following:
_id,time,body,percentage_region1,percentage_region2,reach_lower_bound,reach_upper_bound,languages
"1","2021-07-22","text1","0.1","0.9","100","200","de"
I have the following code so far but it doesn’t seem to work:
cat largefile.json -r | jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.))) | with_entries(select(.value.languages==[“de”])) | @csv
Any help would be much appreciated!
CodePudding user response:
If you happen to know that all items in your input array are structurally identical (i.e. no additional or missing items), you could just cut at a given number of chunks streamed (17
in case of the sample input):
jq --stream -nr '
label $eof
| while(1; fromstream(1 | truncate_stream(range(17) | input? // break $eof)))
| select(.languages == ["de"]) | [..|scalars] | del(.[4,6]) | @csv
' large.json
"1","2021-07-22","text1","0.1","0.9","100","200","de"
CodePudding user response:
There are several separate tasks involved here, and some are underspecified, but hopefully the following will help you through the thicket:
jq -cn --stream '
fromstream(1|truncate_stream(inputs))
| select( .languages == ["de"] )
| [._id, .time, .body[0], .region[].percentage,
.reach.lower_bound, .reach.upper_bound, .languages[0]]
| @csv