I need to extract text from .pdf file uploaded by user. There are so many solutions to get text from pdf file, but to the best of my knowledge in those solutions, you need to give file as argument to open file and then extract text. On the other hand Flask creates an object. To get a path I will have to save it in directory and then read it, but there can be multiple file that have already been uploaded and file selection will be an issue here. With BytesIO you can create an in memory stream, but then I am unable to find a solution that how to extract text from this stream. Can someone help me with how to extract text from .pdf file?
CodePudding user response:
There's a package called pymupdf
that I think will do what you want. An example of the code:
import fitz
fitz.open(stream=input_bytes, filetype="pdf")
all_text = ""
for page in fitz.pages():
all_text = page.get_text("text")