I have a jq
command which I am trying to parallelise using GNU parallel
but for some reason I am not able to get it to work.
The vanilla jq
query is:
jq --raw-output '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | @csv' results.json > test.out
I have tried to use it with parallel
like so:
parallel -j0 --keep-order --spreadstdin "jq --raw-output '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | @csv'" < results.json > test.json
but I get some bizzare compile error:
jq: error: syntax error, unexpected '|', expecting '$' or '[' or '{' (Unix shell quoting issues?) at <top-level>, line 1:
._id as | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ , .[0:rindex( Electronic address:)] ] | @csv
jq: 1 compile error
I think it does not like something re: quoting things in the string, but the error is a bit unhelpful.
UPDATE
Looking at other threads, I managed to construct this:
parallel -a results.json --results test.json -q jq -r '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | @csv'
but now it complains:
parallel: Error: Command line too long (76224 >= 63664) at input 0:
:(
An aexample (firstline) of the json file:
{
"_index": "corpuspm",
"_type": "_doc",
"_id": "6786777",
"_score": 1,
"_source": {
"CitationTextHeader": {
"Article": {
"AuthorList": [
{
"Affiliation": {
"Affiliation": "title, society, American Pediatric Society. [email protected]."
}
}
]
}
}
}
}
CodePudding user response:
results.json
is a large file containing a json on each line
You could use --spreadstdin
and -n1
to linewise spread the input into your jq
filter. Without knowing about the structure of your input JSONs, I have just copied over your "vanilla" filter:
< results.json > test.out parallel -j0 -n1 -k --spreadstdin 'jq -r '\''
._id as $id | ._source.CitationTextHeader.Article.AuthorList[]?
| .Affiliation.Affiliation | [$id, .[0:rindex(" Electronic address:")]]
| @csv
'\'
CodePudding user response:
Without more info this will be a guess:
doit() {
jq --raw-output '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | @csv'
}
export -f doit
cat results.json | parallel --pipe doit > test.out
It reads blocks of ~1 MB from results.json
which it passes to doit
.
If that works, you may be able to speed up the processing with:
parallel --block -1 -a results.json --pipepart doit > test.out
It will on-the-fly split up results.json
into n parts (where n = number of CPU threads). Each part will be piped into doit
. The overhead of this is quite small.
Add --keep-order
if you need the output to be in the same order as input.
If your disks are slow and your CPU is fast, this may be even faster:
parallel --lb --block -1 -a results.json --pipepart doit > test.out
It will buffer in RAM instead of in tempfiles. --keep-order
will, however, not be useful here because the output from job 2 will only be read after job 1 is done.