I'm trying to process PDFs using PyMuPDF and I'm running this python file called process_pdf.py in the terminal.
> import sys, fitz
> fname = sys.argv[1] # get document filename
> doc = fitz.open(fname) # open document
> out = open(fname ".txt", "wb") # open text output
> for page in doc: # iterate the document pages
> text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
> out.write(text) # write text of page
> out.close()
Then I would feed in a pdf in the terminal such as python process_pdf.py 1.pdf
. This would then produce 1.txt (text version of 1.pdf). A question I have is that can I make a simple program in the terminal that can iterate python process_pdf.py document_name.pdf
multiple times like how a for-loop works? This is because the file names are sequential numbers.
I thought about making a for-loop such as
> for i in range(1,101):
> python process_pdf.py i.pdf
But that isn't how python works. P.S. Sorry if this doesn't make any sense; I'm very new into coding :(
CodePudding user response:
Well, yes. you can execute any process with python, including python.exe (or /usr/bin/python3 if on linux) and give it any parameters you want.
subprocess.popen, os.system, etc.
There are some better ways mentioned here for specifically running python scripts from python. (runpy)
but... this feels like an xy problem.
how about simply generating the file names in the code?
import sys, fitz
for i in range(1,101):
fname = f"{i}.pdf" # get document filename
doc = fitz.open(fname) # open document
out = open(fname ".txt", "wb") # open text output
for page in doc: # iterate the document pages
text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
out.write(text) # write text of page
out.close()
also, im unfamiliar with "fitz" but maybe you need to close the "doc" file. check out the "with" statement.
CodePudding user response:
If you want to execute the for loop from the python shell and you don't want to use subprocess then rewrite the module and put the instructions in a function.
process_pdf.py
import sys, fitz
def func(fname):
doc = fitz.open(fname) # open document
with open(fname ".txt", "wb") as out: # open text output
for page in doc: # iterate the document pages
# get plain text (is in UTF-8)
# write text of page
Import the function in the python shell and call it in the for loop.
>>> from process_pdf import func
>>> for i in range(1,101):
... func('{}.pdf'.format(i))
... # func(f'{i}.py')
...
Or import the module and call the function using dot notation.
>>> import process_pdf
>>> for i in range(1,101):
... process_pdf.func('{}.pdf'.format(i))
... # process_pdf.func(f'{i}.py')
...