Using jq to stream, filter large json file and save output as csv-CodePudding

I have a very large json file I would like to stream (using --stream) and filter with jq, then save it as a csv.

This is the sample data with two objects:

[{"_id":"1","time":"2021-07-22","body":["text1"],"region":[{"percentage":"0.1","region":"region1"},{“percentage":"0.9","region":"region2"}],"reach":{"lower_bound":"100","upper_bound":"200"},"languages":["de"]},
{"_id":"2","time":"2021-07-23","body":["text2"],"region":[{"percentage":"0.3","region":"region1"},{“percentage":"0.7","region":"region2"}],"reach":{"lower_bound":"10","upper_bound":"20"},"languages":["en"]}]

I want to filter on the "languages" field in jq stream so I only retain objects where languages==[“de”], then save it as a new csv file titled largefile.csv such that the new csv file looks like the following:

_id,time,body,percentage_region1,percentage_region2,reach_lower_bound,reach_upper_bound,languages
"1","2021-07-22","text1","0.1","0.9","100","200","de"

I have the following code so far but it doesn’t seem to work:

cat largefile.json -r | jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.))) | with_entries(select(.value.languages==[“de”])) | @csv

Any help would be much appreciated!

CodePudding user response：

If you happen to know that all items in your input array are structurally identical (i.e. no additional or missing items), you could just cut at a given number of chunks streamed (17 in case of the sample input):

jq --stream -nr '
  label $eof
  | while(1; fromstream(1 | truncate_stream(range(17) | input? // break $eof)))
  | select(.languages == ["de"]) | [..|scalars] | del(.[4,6]) | @csv
' large.json

"1","2021-07-22","text1","0.1","0.9","100","200","de"

Demo

CodePudding user response：

There are several separate tasks involved here, and some are underspecified, but hopefully the following will help you through the thicket:

jq -cn --stream '
  fromstream(1|truncate_stream(inputs))
  | select( .languages == ["de"] ) 
  | [._id, .time, .body[0], .region[].percentage,
     .reach.lower_bound, .reach.upper_bound, .languages[0]]
  | @csv