Home > Blockchain >  Peak memory usage much larger when using pandas read_csv with StringIO instead of a file object
Peak memory usage much larger when using pandas read_csv with StringIO instead of a file object

Time:04-26

I have a CSV that is 600MB and I load it with pandas' read_csv with one of the two methods below.

def read_my_csv1():
    df = pd.read_csv('my_data.csv')
    print(len(df))

def read_my_csv2():
    with open('my_data.csv') as f:
        file_contents = f.read()
    data_frame = pd.read_csv(io.StringIO(file_contents))
    print(len(data_frame))

The first method gives a peak memory usage of 1GB.

The second methods gives a peak memory usage of 4GB.

I measure the peak memory usage with fil-profile.

How can the difference be so large? Is there a way to load a CSV from a string that doesn't make peak memory usage go through the roof?

CodePudding user response:

How can the difference be so large?

StringIO uses a buffer of type Py_UCS4 [source]. That is a 32 bit datatype, while the CSV file is probably ASCII or UTF-8. So we have an overhead of factor 3 here, accounting for additional ~1.8 GB. Also, the StringIO buffer may overallocate for 12.5% [source].

Best case:

file_contents    600 MB
io.StringIO     2400 MB
data_frame       600 MB (at least)
DLLs, EXEs, ...    ? MB
-----------------------
                3600 MB (at least)

Case with 12,5% overallocation:

file_contents    600 MB
io.StringIO     2700 MB
data_frame       600 MB (at least)
DLLs, EXEs, ...    ? MB
-----------------------
                3900 MB (at least)

Is there a way to load a CSV from a string that doesn't make peak memory usage go through the roof?

  • del the temporary objects
  • Don't use StringIO.

CodePudding user response:

It looks like StringIO maintains its own copy of the string data, so at least temporarily you have three copies of your data in memory — one in file_contents, one in the StringIO object, and one in the final dataframe. Meanwhile, it is at least theoretically possible for read_csv to read the input file line by line, and thereby only have one copy of the full data, in the final dataframe, when reading directly from the file.

You could try deleting file_contents after creating the StringIO object and see if that improves things.

  • Related