I understand that generators in Python can help for reading and processing large files when specific transformations or outputs are needed from the file (i.e. such as reading a specific column or computing an aggregation).
However, for me it's not clear if there is any benefit in using generators in Python when the only purpose is to read the entire file.
Edit: Assuming your dataset fits in memory.
Lazy Method for Reading Big File in Python?
pd.read_csv('sample_file.csv', chunksize=chunksize)
vs.
pd.read_csv('sample_file.csv')
Are generators useful just to read the entire data without any data processing?
CodePudding user response:
The DataFrame you get from pd.read_csv('sample_file.csv')
might fit into memory; however, pd.read_csv
itself is a memory intensive function so while reading a file that will end up consuming 10 gigabytes of memory your actual memory usage may exceed 30-40 gigabytes. In cases like this, reading the file in smaller chunks might be the only option.