Home > other >  How to decompress a very large zipped file (.zip ~10 GBs)?
How to decompress a very large zipped file (.zip ~10 GBs)?

Time:05-26

How to decompress a very large zipped file (.zip ~10 GBs) using a python library? This is a 50 GBs compressed CSV file. I used the following code:

import zipfile
import zlib
import os

src = open(r"..\data.zip", "rb")

zf = zipfile.ZipFile( src )

for m in  zf.infolist():
    # Examine the header
    print ("Info ::",m.filename, m.header_offset)
    src.seek( m.header_offset )
    src.read( 30 ) # Good to use struct to unpack this.
    nm= src.read( len(m.filename) )
    if len(m.extra) > 0: ex= src.read( len(m.extra) )
    if len(m.comment) > 0: cm= src.read( len(m.comment) )
    # Build a decompression object
    decomp= zlib.decompressobj(-15)
    # This can be done with a loop reading blocks
    out= open( m.filename, "wb " )
    print("Out ::",out )
    result= decomp.decompress(src.read( m.compress_size ), )
    out.write( result )
    result = decomp.flush()
    out.write( result )
    # end of the loop
    out.close()
zf.close()
src.close()

I get the following error:

Info :: data.csv 0 **2853497750** b'\x01\x00\x08\x009\xd7\xb3T\x05\x00\x00\x00' b''
Out :: <_io.BufferedRandom name='Sample_big.csv'>
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
Input In [7], in <cell line: 5>()
     16 out= open( m.filename, "wb " )
     17 print("Out ::",out )
---> 18 result= decomp.decompress(src.read( m.compress_size ), )
     19 out.write( result )
     20 result = decomp.flush()

error: Error -3 while decompressing data: invalid block type**

I need to transform the zipped file to hdf5 in order to manipulate the data using the vaex library.

CodePudding user response:

There is no point in you attempting (and failing) to interpret and act on the details of the zip file data structures, not to mention creating and writing to subdirectories specified therein, when the whole point of Python's ZipFile is to handle that all for you.

If you want to extract the contents, just use zf.extractall(). If you want to extract just one entry, use zf.extract(one entry from the infolist). If you want to read the entry like a file, use f = zf.open(one entry from the infolist), and f.read(some amount).

CodePudding user response:

I could not solve the problem using the zip file library, so I used another approach. The library py7zr works in these types of problem. Here after the solution using the py7zr.

''' python '''

import py7zr
with py7zr.SevenZipFile("file.7z", 'r') as archive:
     archive.extract(path=r"...\tempfolder")
  • Related