I am trying to extract certain rows from a 10GB ~35mil rows csv file into a new csv based on condition (value of a column (Geography = Ontario)). It runs for a few minutes and I can see my free hard drive space getting drained from 14GB to basically zero and then get the MemoryError. I thought chunksize would help here but it did not :( Please advise.
import pandas as pd
df = pd.read_csv("Data.csv", chunksize = 10000)
result = pd.concat(df)
output=result[result['Geography']=='Ontario']
rowcount=len(output)
print(output)
print(rowcount)
output.to_csv('data2.csv')
CodePudding user response:
You can try writing in chunks. Roughly:
df = pd.read_csv("Data.csv", chunksize = 10000)
header = True
for chunk in df:
chunk=chunk[chunk['Geography']=='Ontario']
chunk.to_csv(outfilename, header=header, mode='a')
header = False
Idea from here.