Home > Software engineering >  Read .gz files inside .tar files without extracting
Read .gz files inside .tar files without extracting

Time:12-10

I have a .tar file that contains many .gz files inside a folder. Each of these gz files contain a .txt file. Other stackoverflow questions related to this problem are aimed at extracting the files.

I am trying to iteratively read the content of each .txt file without extracting them, because the .tar is large.

First I read the contents of the .tar file:

import tarfile
tar = tarfile.open("FILE.tar")
tar.getmembers()

Or in Unix:

tar xvf file.tar -O

Then I tried using the tarfile extractfile method, but I'm getting an error: "module 'tarfile' has no attribute 'extractfile'". Besides, I'm not even sure that is the right method.

import gzip
for member in tar.getmembers():
    m = tarfile.extractfile(member)
    file_contents = gzip.GzipFile(fileobj=m).read()

If you want to create an example file to simulate the original file:

$ mkdir directory
$ touch directory/file1.txt.gz directory/file2.txt.gz directory/file3.txt.gz
$ tar -c -f file.tar directory

This is the final version that worked for me after using Mark Adler's suggestion:

import tarfile
tar = tarfile.open("file.tar")
tar = tar.getmembers()

# Here I append the results in a list, because I wasn't able to
# parse the tarfile type returned by .getmembers():
tar_name = []
for elem in members:
    tar_name.append(elem.name)

# Then I changed tarfile.extractfile to tar.extractfile as suggested: 
for member in tar_name:
    # I'm using this because I have other non-gzs in the directory
    if member.endswith(".gz"):    
        m=tar.extractfile(member)
        file_contents = gzip.GzipFile(fileobj=m).read()

CodePudding user response:

You need to use tar.extractfile(member) instead of tarfile.extractfile(member). tarfile is the class, and doesn't know about the tar file you opened. tar is the tarfile object, which references the .tar file you opened.

CodePudding user response:

Here's unix line / bash command:

To prepare the files:

$ git clone https://github.com/githubtraining/hellogitworld.git
$ cd hellogitworld
$ gzip *
$ ls
build.gradle.gz  fix.txt.gz  pom.xml.gz  README.txt.gz  resources  runme.sh.gz  src
$ cd ..
$ tar -cf hellogitworld.tar hellogitworld/

Here is how to view its readme:

$ tar -Oxf hellogitworld.tar hellogitworld/README.txt.gz | zcat

Result:

This is a sample project students can use during Matthew's Git class.

Here is an addition by me

We can have a bit of fun with this repo, knowing that we can always reset it to a known good state.  We can apply labels, and branch, then add new code and merge it in to the master branch.

As a quick reminder, this came from one of three locations in either SSH, Git, or HTTPS format:

* [email protected]:matthewmccullough/hellogitworld.git
* git://github.com/matthewmccullough/hellogitworld.git
* https://[email protected]/matthewmccullough/hellogitworld.git

We can, as an example effort, even modify this README and change it as if it were source code for the purposes of the class.

This demo also includes an image with changes on a branch for examination of image diff on GitHub.

Note that I'm not associated with those git repository.

Explanation for tar:

  • flag -x = extract
  • flag -O = don't write the file to filesystem but write to STDOUT
  • flag -f = specify a file

Then the rest is just piping the result to zcat to see the uncompressed plaintext in STDOUT

  • Related