Home > Enterprise >  UTF-16 representation of arbitrarily long binary string
UTF-16 representation of arbitrarily long binary string

Time:11-17

I have binary strings and want to store them as compactly (in terms of disk space) as possible. They can be between 1 and ~1000 bits of the form '011010010110100110010101'. Storing these as "TEXT" is wasteful. I'd like to retrieve and convert them back to the original binary string. SQLite's TEXT type can be UTF-8, UTF-16BE, or UTF-16LE.

Example:

'0110101011011000001101010010111001101010011101101010100011001101001001010110010010010101010100'

Convert to :

'櫘㔮橶꣍╤锔'

for storage in the database and convert back to original binary string.

I tried many solutions but either the binary string is in the wrong format or the characters are output in some format like '\xff\xfe0\x001' which is still wasteful or only works with multiples of 8 or 16 bits.

CodePudding user response:

Convert the binary string to an integer and track the number of bits. A BLOB of byte data can be generated for storage as SQLITE INTEGER size would be exceeded if you have thousands of bits.

Example:

import sqlite3
import os
import math

# store string of binary data packed into bytes
def insert(cur, binary):
    bit_count = len(binary)
    byte_count = math.ceil(bit_count / 8)
    byte_data = int(binary, 2).to_bytes(byte_count, 'little')
    cur.execute("insert into test values(?, ?)", (bit_count, byte_data))

# set up database
con = sqlite3.connect(':memory:')
cur = con.cursor()
cur.execute('create table test(bit_count, binary)')
con.commit()

# insert some strings
for binary in ['101', '000011', '101010111011011011110001', '0110' * 250]:
    insert(cur, binary)

# retrieve, unpack and display
res = cur.execute('select * from test')
for bit_count, byte_data in res:
    integer = int.from_bytes(byte_data, 'little')
    binary = f'{integer:0{bit_count}b}'
    print(binary)

Output:

101
000011
101010111011011011110001
0110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110
  • Related