Home > database >  How do I compress a rather long binary string in Python so that I will be able to access it later?
How do I compress a rather long binary string in Python so that I will be able to access it later?

Time:03-26

I have a long array of items (4700) that will ultimately be 1 or 0 when compared to settings in another list. I want to be able to construct a single integer/string item that I can store in some of the metadata such that it can be accessed later in order to uniquely identify the combination of items that goes into it.

I am writing this all in Python. I am thinking of doing something like zlib compression plus a hex conversion, but I am getting myself confused as to how to do the inverse transformation. So assuming bin_string is the string array of 1's and 0's it should look something like this

import zlib
#example bin_string, real one is much longer
bin_string="1001010010100101010010100101010010101010000010100101010" 
compressed = zlib.compress(bin_string.encode())
this_hex = compressed.hex()

where I can then save this_hex to the metadata. The question is, how do I get the original bin_string back from my hex value? I have lots of Python experience with numerical methods and such but little with compression, so any basic insights would be very valuable.

CodePudding user response:

You should try using the .savez_compressed() method of numpy

Convert your simple array into a numpy array amd then use this -

numpy.savez_compressed("filename.npz")

Use

numpy.load()

To load the .npz file

CodePudding user response:

Just do the inverse of each operation. This:

zlib.decompress(bytearray.fromhex(this_hex)).decode()

will return your original string.

It would be faster and might even result in better compression to simply encode your bits as bits in a byte string, along with the number of bits not used in the last byte. That would be eight bytes instead of the 22 you're getting from zlib.compress(). zlib would do better only if there is a strong bias for 0's or 1's, and/or there are repeating patterns in the 0's and 1's.

As for encoding for the metadata, Base64 would be more compact than hexadecimal.

  • Related