I have a huge dataset (around 30 GB size), and I need to break the CSV into smaller CSV files. The traditional way to use skipRows
arguement seems to be taking a lot of time. I think that the process could be much faster if, after reading the initial rowSize
; say 1000, we delete those rows from the CSV file and so after every iteration, we won't have to skip the rows, which is basically reading those number of lines every time.
Is there any way to implement this?
CodePudding user response:
To conserve memory, it would be better to read your large CSV file in chunks rather than attempt to load the whole file at once. Each chunk would then be able to fit comfortably in memory. This is done using the chunksize
parameter for read_csv()
.
Each chunk is returned as its own dataframe which can then be written out to a separate CSV file as required. For example:
import pandas as pd
with pd.read_csv("large.csv", chunksize=1000) as reader:
for chunk_number, df_chunk in enumerate(reader, start=1):
print(chunk_number)
df_chunk.to_csv(f"large_chunk{chunk_number:03}.csv", index=False)
This would create multiple output CSV files named large_chunk001.csv
and so on. Each chunk would contain 1000 rows (I would suggest a much larger number is used).
This would also automatically add the same header to each output CSV file.