Home > Software engineering >  Convert .doc/.docx to .pdf from URL, on-the-fly, with Python, on Linux
Convert .doc/.docx to .pdf from URL, on-the-fly, with Python, on Linux

Time:06-24

I need to capture .doc or .docx files from external sites, convert them to pdf and return the content. To this I add a content-type header, publish through my CMS, cache by CDN, and display within HTML using the Adobe PDF Embed API. I'm using Python 3.7.

As a test, this works:

def generate_pdf():
    subprocess.call(['soffice', '--convert-to', 'pdf',
                    'https://arbitrary.othersite.com/anyfilename.docx'])
    sleep(1)
    myfile = open('anyfilename.pdf', 'rb')
    content = myfile.read()
    os.remove('anyfilename.pdf')
    return content

This would be nice:

def generate_pdf(url):
    result = subprocess.call(['soffice', '--convert-to', 'pdf', url])
    content = result
    return content

The URLs could include any parameters or illegal characters, which might make it hard to guess the resulting file name. Anyway, it would be preferable not to have to sleep, save, read, and delete the converted file.

Is this possible?

CodePudding user response:

I don't think soffice supports outputting to stdout so you don't have many choices. If you output to a temporary directory, you can use listdir to get the filename though:

import subprocess
import tempfile
import os

url = "https://www.usariem.army.mil/assets/docs/journal/Lieberman_DS_survey_and_guidelines.docx"
with tempfile.TemporaryDirectory() as tmpdirname:
  subprocess.run(["soffice", '--convert-to', 'pdf', "--outdir", tmpdirname, url], cwd="/")
  files = os.listdir(tmpdirname)
  if files:
    print(files[0])
  • Related