Home > Mobile >  Uncompress tar.bz2 from s3 and move the files back to s3 from Pyhon
Uncompress tar.bz2 from s3 and move the files back to s3 from Pyhon

Time:10-01

Suppose I have a bzip2 compressed tar archive file x.tar.bz2 stored in s3. I would like to decompress it and place back to s3. This can be achieved by:

from s3fs import S3FileSystem
fs = S3FileSystem()

import tarfile

with fs.open('s3://path_to_source/x.tar.bz2', 'rb') as f:
    r = f.read()
with open('x.tar.bz2', mode='wb') as localfile:     
    localfile.write(r)
tar = tarfile.open('x.tar.bz2', "r:bz2")
tar.extractall(path='extraction/path')
tar.close()

fs.put('extraction/path', f's3://path_to_destination/x', recursive=True)

Within the solution above, I am saving the file content twice into my local disc. I have the following questions (solution is expected to be done using Python):

  1. Is it possible (using tarfile module) to load data directly from s3 and also extract it there avoiding to store data on local drive?
  2. Is it possible to make this job in a streaming mode without need to have the whole x.tar.bz2 (or at least the uncompressed archive x.tar) file in memory?

CodePudding user response:

tarfile.open accepts a file-like object as the fileobj argument, so you can pass to it the file object you get from S3FileSystem.open. You can then iterate through the TarInfo objects in the tar object, and open the corresponding path in S3 for writing:

with fs.open('s3://path_to_source/x.tar.bz2', 'rb') as f:
    with tarfile.open(fileobj=f, mode='r:bz2') as tar:
        for entry in tar:
            with fs.open(f'path_to_destination/{entry.name}', mode='wb') as writer:
                writer.write(tar.extractfile(entry).read())
  • Related