Retaining the fond of pdf to epub-CodePudding

I'm currently working on a project which is to convert pdf to epub using python. While converting the pdf to epub the styling like font family, font size need to be exactly same in epub as that of pdf. Is there a way to achieve this using python? And i don't need any external softwares to do it. I used aspose.

#code i used

import aspose.words as aw

doc = aw.Document("Input.pdf") doc.save("Output.epub")

and it is a simple text pdf.

CodePudding user response：

You are going to get a variety of answers/comments that will ask you to show code as to what you tried and post sample documents etc.

Let me save you the trouble. Your question seems straightforward in that want to convert a pdf to epub and retain the style information.

Good luck.

It will all depend on your PDF file. Does it have embedded fonts or does it rely on system fonts? Complicated layout? Headers and footers? What about images? Dingbats characters? What if there is no text in the pdf, but just postscript drawing of text characters? What if the PDF just consists of multiple scans of pages in a pdf container? Is everything in English? Any Unicode characters? Are you looking to get the styles right at the page level? Paragraph? Sentence? Word? or Character Level?

Basically this is a hard problem. PDF was designed as an end use format not an interchangeable format. Most things get converted to PDF because someone wanted to control how the final product looked. You can look at text extraction tools for PDF, but there is not an easy solution with opensource or commercial tools.

CodePudding user response：

You can easily convert PDF to EPUB using Aspose.Words for Python. The code is pretty simple:

import aspose.words as aw

doc = aw.Document("C:\\Temp\\in.pdf")
doc.save("C:\\Temp\\out.epub")

However, upon loading PDF into Aspose.Words Document Object Model it is converted from fixed page layout to flow document. And when document is saved to EPUB it is saved as flow document. I am afraid, this might lead into layout and formatting loses upon conversion.