Home > Software engineering >  Python Pandas: read_csv with chunksize and concat still throws MemoryError
Python Pandas: read_csv with chunksize and concat still throws MemoryError

Time:07-11

I am trying to extract certain rows from a 10GB ~35mil rows csv file into a new csv based on condition (value of a column (Geography = Ontario)). It runs for a few minutes and I can see my free hard drive space getting drained from 14GB to basically zero and then get the MemoryError. I thought chunksize would help here but it did not :( Please advise.

import pandas as pd
df = pd.read_csv("Data.csv", chunksize = 10000)
result = pd.concat(df)
output=result[result['Geography']=='Ontario']
rowcount=len(output)
print(output)
print(rowcount)
output.to_csv('data2.csv')

CodePudding user response:

You can try writing in chunks. Roughly:

df = pd.read_csv("Data.csv", chunksize = 10000)
header = True
for chunk in df:
    chunk=chunk[chunk['Geography']=='Ontario']
    chunk.to_csv(outfilename, header=header, mode='a')
    header = False

Idea from here.

  • Related