Home > Blockchain >  How do I write a batch process command using gnu parallel?
How do I write a batch process command using gnu parallel?

Time:10-19

I'm trying to do some batch processing using a package called ocrmypdf.

Here is a command that can process 1 pdf file

ocrmypdf input.pdf output.pdf

and here is a command that can process all pdf files in the directory we run it in.

parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf

Now, I actually want to run this command for all pdf files in the directory. This one takes one more parameter.

ocrmypdf --sidecar txt/input.txt input.pdf out/output.pdf

I tried rewriting the parallel command earlier like this:

parallel --tag -j 2 ocrmypdf --sidecar txt/{}.txt {}.pdf out/{}.pdf ::: *.pdf

But I get the error:

ocrmypdf: error: the following arguments are required: output_pdf

Can someone help me understand what I'm doing wrong? Thanks!

CodePudding user response:

Try:

parallel --tag -j 2 ocrmypdf --sidecar txt/{.}.txt {} out/{} ::: *.pdf

The .pdf's after the curly brackets are extraneous and will result in inability to locate the input file(s), and for the text one, by adding the period inside the brackets, that auto-removes the extension so you'll end up with .txt instead of .pdf.txt files (with otherwise identical names as inputs)

UPDATE No this doesn't quite work either, I'm getting the same error as you. GNU p can be so tricky sometimes!!

Potential Solution

I believe this should work. To avoid the fuss with quotes, I first created a file with the names of all the pdfs (full relative paths from cwd):

ls --color=none *.pdf | parallel -q printf '%s'\\n {} > ocrmypdf.list

Then, I ran the parallel ocrmypdf like so:

parallel -j 2 ocrmypdf --sidecar txt/{.} {} out/{} :::: ocrmypdf.list

I got an error that my pdf's already have text, but I think it would have worked if they didn't already. The txt and out dirs had to already have been created. Notice the 4 :::: instead of three because it's reading from a file. This will default to one argument per line, so, no worries if there are spaces etc in the pdf filenames.

CodePudding user response:

This works for me:

parallel --tag -j 2 ocrmypdf --sidecar txt/{.}.txt {} out/{} ::: *.pdf

If it does not work for you:

  • Identify a failing file
  • Run the failing file by hand to check that this works
  • Edit your question to include a link to the failing file

(Also be aware of this bug when running multiple tesseracts: https://github.com/tesseract-ocr/tesseract/issues/3109#issuecomment-703845274)

  • Related