Home > database >  Compress html as gzip instead of (pk)zip
Compress html as gzip instead of (pk)zip

Time:06-19

I am trying to adapt a Python 2.7 routine for creating SQLite db dictionaries for an e-reader from TSV files.

The routine shown creates a (pk)zip file of an html string in lines 71-74 (# compress and save). I'd like the compressed file to be in gzip format instead. I'm really, really shaky on Python and I have been looking at examples of using gzip compression, trying to fit them into this routine (which I have already adapted in other ways for my own use from the original). I know that changes need to be made in line 9 (import) also, but am not sure which.

#!/usr/bin/python
# -*- coding: utf-8 -*-

# by Jiri Orsag, 2014
# https://github.com/geoRG77/nook-dictionary
# Many thanks to Homeless Ghost for his script 'createRenateNSTdictionaryfromnookdictionarydb.py'
# which was a great source of ideas for my work

import sqlite3, sys, zipfile, zlib, os

# config
DICTIONARY_FILE = 'test.txt' # input file (needed)
OUTPUT_DB = 'ox_en_GB.db' # output file
TEMP_DIRECTORY = './temp/' # will be deleted after successful run
STEP = 10000 # for print message

########################################################

print 'Converting dictionary...'

con = sqlite3.connect(OUTPUT_DB)
con.text_factory = str
cur = con.cursor()

index = 0
duplicateCount = 1
prevTerm = ''

try:
  if not os.path.exists(TEMP_DIRECTORY):
    os.makedirs(TEMP_DIRECTORY)

  # open dict file
  dict = open(DICTIONARY_FILE, 'r')

  # delete previous tables
  cur.execute('DROP TABLE IF EXISTS android_metadata')
  cur.execute('DROP TABLE IF EXISTS tblWords')

  # create tables
  cur.execute('CREATE TABLE "android_metadata"("locale" TEXT)')
  cur.execute('CREATE TABLE "tblWords"("_id" INTEGER PRIMARY KEY AUTOINCREMENT, "term" TEXT COLLATE nocase, "description" BLOB)')

  # convert dict to sql
  for line in dict:
    index  = 1
    # uncomment next line to debug
    # print '# current line = %d' % index

    # split line
    data = line.split('\t')
    term = data.pop(0)

    # create HTML
    html = '<b>'   term   '</b>'   data[0].strip()

    # check for duplicates
    if term == prevTerm:
      duplicateCount  = 1
      termEdited = term   '['   str(duplicateCount)   ']'
    else:
      termEdited = term
      duplicateCount = 1

    # create html file
    term_stripped = termEdited.replace('/', '')
    temp_html = open(TEMP_DIRECTORY   term_stripped, 'wb')
    temp_html.write(html)
    temp_html.close()

    # compress & save
    zf = zipfile.ZipFile('_temp', mode='w')
    zf.write(TEMP_DIRECTORY   term_stripped)
    zf.close()

    # read & insert compressed data
    temp_compressed = open('_temp', 'rb')
    compressed = temp_compressed.read()
    cur.execute('INSERT INTO tblWords (_id, term, description) VALUES(?, ?, ?)', (index, termEdited, sqlite3.Binary(compressed)))

    # if duplicate then update previous row with [1]
    if duplicateCount == 2:
      cur.execute('UPDATE tblWords SET term="'   str(term   "[1]")   '" WHERE _id='   str(index - 1)   '')

    os.remove(TEMP_DIRECTORY   term_stripped)
    prevTerm = term

    # print _id, term, description
    if ((index % STEP) == 0):
      print '# current line = %d' % index

    #if index == 100:
    #  break;

  # create term_index
  cur.execute('CREATE INDEX term_index on tblWords (term ASC)')
  cur.execute('SELECT * FROM tblWords order by _id LIMIT 10')

  dict.close
  # os.remove('_temp')
  # os.rmdir(TEMP_DIRECTORY)

except Exception, e:
  raise
else:
  pass
finally:
  pass

print 'Done. '   str(index)   ' lines converted.'

Any hints or good references?

CodePudding user response:

import gzip and replace zipfile.ZipFile with gzip.open. Write to the gzip file what you wrote to the temporary file, which is no longer necessary. E.g.

gz = gzip.open('_temp', mode='wb')
gz.write(html)
gz.close()

and get rid of the temp_html lines and the os.remove() of it. I recommend the b in the mode, which was missing for the zip file for some reason.

  • Related