in the code i'm converting multiple 1-page PDFs into PNG Format. The converting itself works out well with cv2 but sadly many documents (PDFs) names contain german umlauts (ä,ö,ü) and the PNGs end up having special characters.
Example: After converting the PDF (lösung_122.png) to PNG, it looks like this "lösung_122.png". It should be loesung_122.png.
I would like to replace all these characters (ä,ö,ü) in the document titles with ae, oe, ue.
How can i adjust my code to archieve this? What options do i have? Maybe theres a way to rename the documents (PDFs) before converting them?
from pdf2image import convert_from_path
import os
import cv2
if __name__ == '__main__':
# Init
dir_name = os.getcwd()
path_pdf = dir_name '/data/doc/October' #Folder containing all documents (PDF)
save_path = dir_name '/data/blanko/' #Folder with all converted doc (PNG)
# Loop sub Folders:
files = os.listdir(path_pdf)
for pdf_file in files:
# Check if PDF file
if pdf_file[-3:] == 'pdf':
images = convert_from_path(path_pdf '/' pdf_file, dpi=300, poppler_path='C:/Develop/poppler-0.68.0_x86/poppler-0.68.0/bin')
# Save Images
images[0].save(save_path 'tmp.png', 'PNG')
img = cv2.imread(save_path 'tmp.png')
cv2.imwrite(save_path pdf_file[:-4] '.png', img)
Any help appreciated
Regards
CodePudding user response:
I's a bug in cv2.imwrite()
that it is is mangling the name you give it. You can try this to unmangle the name:
result = os.path.join(save_path, os.path.splitext(pdf_file)[0] '.png')
cv2.imwrite(result, img)
os.rename(result.encode().decode('mbcs'),result)
This renames the file form the mangled form back to the original. Note this doesn't remove the umlauts, since Windows can handle those characters in names.
Note, though, that it can only restore characters represented in your local encoding, which is probably Windows-1252.