I have a CSV that is 600MB and I load it with pandas' read_csv
with one of the two methods below.
def read_my_csv1():
df = pd.read_csv('my_data.csv')
print(len(df))
def read_my_csv2():
with open('my_data.csv') as f:
file_contents = f.read()
data_frame = pd.read_csv(io.StringIO(file_contents))
print(len(data_frame))
The first method gives a peak memory usage of 1GB.
The second methods gives a peak memory usage of 4GB.
I measure the peak memory usage with fil-profile
.
How can the difference be so large? Is there a way to load a CSV from a string that doesn't make peak memory usage go through the roof?
CodePudding user response:
How can the difference be so large?
StringIO
uses a buffer of type Py_UCS4
[source]. That is a 32 bit datatype, while the CSV file is probably ASCII or UTF-8. So we have an overhead of factor 3 here, accounting for additional ~1.8 GB. Also, the StringIO
buffer may overallocate for 12.5% [source].
Best case:
file_contents 600 MB
io.StringIO 2400 MB
data_frame 600 MB (at least)
DLLs, EXEs, ... ? MB
-----------------------
3600 MB (at least)
Case with 12,5% overallocation:
file_contents 600 MB
io.StringIO 2700 MB
data_frame 600 MB (at least)
DLLs, EXEs, ... ? MB
-----------------------
3900 MB (at least)
Is there a way to load a CSV from a string that doesn't make peak memory usage go through the roof?
del
the temporary objects- Don't use StringIO.
CodePudding user response:
It looks like StringIO
maintains its own copy of the string data, so at least temporarily you have three copies of your data in memory — one in file_contents
, one in the StringIO
object, and one in the final dataframe. Meanwhile, it is at least theoretically possible for read_csv
to read the input file line by line, and thereby only have one copy of the full data, in the final dataframe, when reading directly from the file.
You could try del
eting file_contents
after creating the StringIO
object and see if that improves things.