In Python, I would like to read a huge CSV file stored in a zip file, but only a fixed sized chunk a-CodePudding

I am working with a 468 MB zip file that contains a single file, which is a CSV text file. I don't want to extract the entire text file, so I read the zip file a binary chunk at a time. The chunk size is something like 65536 bytes.

I know I can read the file with Python's csvfile library, but in this case, the chunks that I feed it will not necessarily fall on a line boundary.

How can I do this? (p.s., I do not want to have to use Pandas)

Thanks.

CodePudding user response：

You just need to do something like:

import zipfile
import io
import csv


with zipfile.ZipFile("test.zip") as zipf:
    with zipf.open("test.csv", "r") as f:
        reader = csv.reader(io.TextIOWrapper(f, newline='')):
        for row in reader:
            do_something(row)

Assuming you have a zip archive like:

jarrivillaga$ unzip -l test.zip
Archive:  test.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
1308888890  04-01-2022 16:23   test.csv
---------                     -------
1308888890                     1 file

Note, the zipf.open returns a binary stream, so you can just use an io.TextIOWrapper to make it a text stream, which would work with any of the csv.reader or csv.DictReader objects.

This should read it in reasonably sized chunks by default, probably whatever io.DEFAULT_BUFFER_SIZE is, because looking at the zipfile.ZipExtFile source code it is inheriting from io.BufferedIOBase.