using jq and gnu parallel together-CodePudding

I have a jq command which I am trying to parallelise using GNU parallel but for some reason I am not able to get it to work.

The vanilla jq query is:

jq --raw-output '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | @csv' results.json > test.out

I have tried to use it with parallel like so:

parallel -j0 --keep-order --spreadstdin "jq --raw-output '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | @csv'" < results.json > test.json

but I get some bizzare compile error:

jq: error: syntax error, unexpected '|', expecting '$' or '[' or '{' (Unix shell quoting issues?) at <top-level>, line 1:
._id as  | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ , .[0:rindex( Electronic address:)] ] | @csv         
jq: 1 compile error

I think it does not like something re: quoting things in the string, but the error is a bit unhelpful.

UPDATE

Looking at other threads, I managed to construct this:

parallel -a results.json --results test.json -q jq -r '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | @csv'

but now it complains:

parallel: Error: Command line too long (76224 >= 63664) at input 0:

An aexample (firstline) of the json file:

{
  "_index": "corpuspm",
  "_type": "_doc",
  "_id": "6786777",
  "_score": 1,
  "_source": {
    "CitationTextHeader": {
      "Article": {
        "AuthorList": [
          {
            "Affiliation": {
              "Affiliation": "title, society, American Pediatric Society. [email protected]."
            }
          }
        ]
      }
    }
  }
}

CodePudding user response：

results.json is a large file containing a json on each line

You could use --spreadstdin and -n1 to linewise spread the input into your jq filter. Without knowing about the structure of your input JSONs, I have just copied over your "vanilla" filter:

< results.json > test.out parallel -j0 -n1 -k --spreadstdin 'jq -r '\''
  ._id as $id | ._source.CitationTextHeader.Article.AuthorList[]?
  | .Affiliation.Affiliation | [$id, .[0:rindex(" Electronic address:")]]
  | @csv
'\'

CodePudding user response：

Without more info this will be a guess:

doit() {
  jq --raw-output '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | @csv'
}
export -f doit
cat results.json | parallel --pipe doit > test.out

It reads blocks of ~1 MB from results.json which it passes to doit.

If that works, you may be able to speed up the processing with:

parallel --block -1 -a results.json --pipepart doit > test.out

It will on-the-fly split up results.json into n parts (where n = number of CPU threads). Each part will be piped into doit. The overhead of this is quite small.

Add --keep-order if you need the output to be in the same order as input.

If your disks are slow and your CPU is fast, this may be even faster:

parallel --lb --block -1 -a results.json --pipepart doit > test.out

It will buffer in RAM instead of in tempfiles. --keep-order will, however, not be useful here because the output from job 2 will only be read after job 1 is done.