I am working with a 468 MB zip file that contains a single file, which is a CSV text file. I don't want to extract the entire text file, so I read the zip file a binary chunk at a time. The chunk size is something like 65536 bytes.
I know I can read the file with Python's csvfile library, but in this case, the chunks that I feed it will not necessarily fall on a line boundary.
How can I do this? (p.s., I do not want to have to use Pandas)
Thanks.
CodePudding user response:
You just need to do something like:
import zipfile
import io
import csv
with zipfile.ZipFile("test.zip") as zipf:
with zipf.open("test.csv", "r") as f:
reader = csv.reader(io.TextIOWrapper(f, newline='')):
for row in reader:
do_something(row)
Assuming you have a zip archive like:
jarrivillaga$ unzip -l test.zip
Archive: test.zip
Length Date Time Name
--------- ---------- ----- ----
1308888890 04-01-2022 16:23 test.csv
--------- -------
1308888890 1 file
Note, the zipf.open
returns a binary stream, so you can just use an io.TextIOWrapper
to make it a text stream, which would work with any of the csv.reader
or csv.DictReader
objects.
This should read it in reasonably sized chunks by default, probably whatever io.DEFAULT_BUFFER_SIZE
is, because looking at the zipfile.ZipExtFile
source code it is inheriting from io.BufferedIOBase
.