Suppose I have a bzip2
compressed tar
archive file x.tar.bz2
stored in s3
. I would like to decompress it and place back to s3
. This can be achieved by:
from s3fs import S3FileSystem
fs = S3FileSystem()
import tarfile
with fs.open('s3://path_to_source/x.tar.bz2', 'rb') as f:
r = f.read()
with open('x.tar.bz2', mode='wb') as localfile:
localfile.write(r)
tar = tarfile.open('x.tar.bz2', "r:bz2")
tar.extractall(path='extraction/path')
tar.close()
fs.put('extraction/path', f's3://path_to_destination/x', recursive=True)
Within the solution above, I am saving the file content twice into my local disc. I have the following questions (solution is expected to be done using Python):
- Is it possible (using
tarfile
module) to load data directly from s3 and also extract it there avoiding to store data on local drive? - Is it possible to make this job in a streaming mode without need to have the whole
x.tar.bz2
(or at least the uncompressed archivex.tar
) file in memory?
CodePudding user response:
tarfile.open
accepts a file-like object as the fileobj
argument, so you can pass to it the file object you get from S3FileSystem.open
. You can then iterate through the TarInfo
objects in the tar object, and open the corresponding path in S3 for writing:
with fs.open('s3://path_to_source/x.tar.bz2', 'rb') as f:
with tarfile.open(fileobj=f, mode='r:bz2') as tar:
for entry in tar:
with fs.open(f'path_to_destination/{entry.name}', mode='wb') as writer:
writer.write(tar.extractfile(entry).read())