cv2 rename ä ö ü to ae oe ue-CodePudding

in the code i'm converting multiple 1-page PDFs into PNG Format. The converting itself works out well with cv2 but sadly many documents (PDFs) names contain german umlauts (ä,ö,ü) and the PNGs end up having special characters.

Example: After converting the PDF (lösung_122.png) to PNG, it looks like this "lÃ¶sung_122.png". It should be loesung_122.png.

I would like to replace all these characters (ä,ö,ü) in the document titles with ae, oe, ue.

How can i adjust my code to archieve this? What options do i have? Maybe theres a way to rename the documents (PDFs) before converting them?

from pdf2image import convert_from_path
import os
import cv2


if __name__ == '__main__':

    # Init
    dir_name = os.getcwd()
    path_pdf = dir_name   '/data/doc/October' #Folder containing all documents (PDF)
    save_path = dir_name   '/data/blanko/' #Folder with all converted doc (PNG)

    # Loop sub Folders:
    files = os.listdir(path_pdf)
    for pdf_file in files:

        # Check if PDF file
        if pdf_file[-3:] == 'pdf':
            images = convert_from_path(path_pdf   '/'   pdf_file, dpi=300, poppler_path='C:/Develop/poppler-0.68.0_x86/poppler-0.68.0/bin')

            # Save Images
            images[0].save(save_path   'tmp.png', 'PNG')
            img = cv2.imread(save_path   'tmp.png')
            cv2.imwrite(save_path   pdf_file[:-4]   '.png', img)

Any help appreciated

Regards

CodePudding user response：

I's a bug in cv2.imwrite() that it is is mangling the name you give it. You can try this to unmangle the name:

    result = os.path.join(save_path, os.path.splitext(pdf_file)[0]   '.png')
    cv2.imwrite(result, img)
    os.rename(result.encode().decode('mbcs'),result)

This renames the file form the mangled form back to the original. Note this doesn't remove the umlauts, since Windows can handle those characters in names.

Note, though, that it can only restore characters represented in your local encoding, which is probably Windows-1252.