Home > Mobile >  How to write a table encoded as a list of dictionaries directly to a zipped archive containing a CSV
How to write a table encoded as a list of dictionaries directly to a zipped archive containing a CSV

Time:07-15

Suppose you have data in the form of a list of dictionaries like d here:

d = [{'a' : 1, 'b' : 2}, {'a' : 3, 'c' : 5}]

and you want to save it as a comma-separated table to a zipped (not gzipped, I mean a .zip archive) CSV without going via, e.g., a pandas.DataFrame.from_dict().

Why not via pandas? Because d in real practice may correspond to a very large, but especially sparse, DataFrame, i.e. a table with many more columns than non-NA data per row, which for some reason occupies a huge amount of memory (BTW this is not a theory: it made our scripts crash several times, hence our need to work around it).

d is a sort of unpivoted-in-disguise version of the data, because each dictionary only contains the relevant data, not a useless sequence of NA's.

From the csv module's documentation I learned how to write d directly to a CSV:

with open('test.csv', 'w') as csvfile :
    writer = csv.DictWriter(csvfile, fieldnames = ['a','b','c'])
    writer.writeheader()
    writer.writerows(d)

but I don't see any option to write to a zipped CSV.

I consulted the documentation of zipfile, but I could not make it work, due to the usual problem between text and bytes.

if os.path.exists('test.csv.zip') :
    os.remove('test.csv.zip')
with zipfile.ZipFile('test.csv.zip', mode = 'a') as zip :
    with zip.open('test.csv', 'w') as csvfile :
        writer = csv.DictWriter(csvfile, fieldnames = ['a','b','c'])
        writer.writeheader()
        writer.writerows(d)

# TypeError: a bytes-like object is required, not 'str'

Can anyone think of a workaround, or maybe a radically different approach that I am not seeing?

The fundamental constraints are:

  1. d is always going to be generated: this we cannot decide or change
  2. avoid generating very large objects that consume as much memory or disk space as the dense pandas.DataFrame.from_dict()
  3. the data must be written to a csv.zip archive.

Otherwise we would write to a CSV, hoping that it is not too huge (but yeah, that was the initial issue, so...), and zip it afterwards.


EDIT posting the implementation from Daweo's answer, for completeness.

import os
import zipfile
import csv
import codecs
utf8 = codecs.getwriter('utf_8') # or other encoding dictated by requirements

output_zip_file = 'test.csv.zip'

if os.path.exists(output_zip_file) :
    os.remove(output_zip_file)
with zipfile.ZipFile(output_zip_file, mode = 'a') as zip :
    with zip.open('out.csv', 'w') as csvfile :
        writer = csv.DictWriter(utf8(csvfile), fieldnames = ['a','b','c'])
        writer.writeheader()
        writer.writerows(d)

CodePudding user response:

You might use codecs.StreamWriter if you want to use csv.DictWriter with binary file-handle, consider following simple example

import csv
import codecs
utf8 = codecs.getwriter('utf_8') # or other encoding dictated by requirements
with open("file.csv","wb") as f:
    writer = csv.DictWriter(utf8(f), fieldnames = ['a','b','c'])
    writer.writeheader()
    writer.writerows([{'a':1},{'b':2},{'c':3}])

creates file.csv holding

a,b,c
1,,
,2,
,,3
  • Related