Read in very large csv in chunks and append to one dataframe to be used for computation-CodePudding

I have a 8 GB large csv file. My RAM size is 16 GB. When I try to read it in:

I get a memory error. So I tried to read it in using chunksize parameter, like this:

import pandas as pd
import csv


dtypes= {     "Column1": str, "Column2": str
        }


complete_data=pd.read_csv(r'C:\folder\test.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=1000000)
dfcompl = pd.concat(complete_data, ignore_index=True)

Again, I get a memory error. According to this solution I tried:

import pandas as pd
import csv

dtypes= {     "Column1": str, "Column2": str
        }

with pd.read_csv(r'C:\folder\test.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=1000000) as reader:
    for chunk in reader:
        process(chunk)
        
dfcompl = pd.concat(chunk)

But I get an error NameError: name 'process' is not defined. Obviously I have to change the 'process'. However, I don't know what to do. Looks actually like a simple task, I hoped that simply adding chunksize would solve it, however I don't know how to solve this issue.

So how can I read in a large csv file and append to one dataframe with which I can work with in pandas?

I do not want to use dask.

My problem also is that even if I process chunk by chunk and export for example to pkl files, I still have the problem at the end, that if I try to conacatenate these pkl files, I get a memory error again.

CodePudding user response：

There is no real point in reading csv file in chunks if you want to collect all chunks in a single data frame afterwards - it will require ~8Gb of memory anyway. The whole idea behind chunking is to process your data in parts, so you never require full memory for that (image if your CSV is not 8Gb, but 80!). That exactly happens in your second snippet:

import pandas as pd
import csv

dtypes= {     "Column1": str, "Column2": str
        }

with pd.read_csv(r'C:\folder\test.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=1000000) as reader:
    for chunk in reader:
        process(chunk)
        
dfcompl = pd.concat(chunk)

where process is some function that performs computations over your data. You have to implement it. You get a chunk of data, process and throw it away immediately.

Try to look at your task from this perspective, do you really need all data at once or you could do something incrementally, line-by-line, if put it to some extreme?

UPDATE

So if you want to collect only lines where Column3 < 900, you could do the following:

output = None

with pd.read_csv(r'C:\folder\test.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=1000000) as reader:
    for chunk in reader:
        filtered = chunk['Column3'] < 900

        if output is None:
            output = filtered
            continue

        output = pd.concat([filtered, output])
        
output.to_csv(some_output_path)

It's not optimal and a bit ugly, but illustrates the idea.

CodePudding user response：

I think the original post may refer to his own function 'process'

try this

import pandas as pd

dtypes= {     "Column1": str, "Column2": str
        }

dfcompl=None
with pd.read_csv(r'C:\folder\test.csv'
  , sep=";", encoding="utf-8"
  , dtype=dtypes
  , decimal=","
  , chunksize=1000000) as reader:
    for chunk in reader:
        if dfcompl is None:
            dfcompl=chunk
        else:
            dfcompl=pd.concat([dfcompl,chunk], axis=0)
        
dfcompl

dfcompl is a final dataframe you need, I initial it with None, then using pd.concat([ dfcompl, chunk ], axis=0) the previous one with each chunk. Note that axis=0 is the way to concat or append at the end, so far.

I hope this help.