To read and write large processed data to a csv file in chunks using csv module in python-CodePudding

I am reading a csv file using the csv module(read) and manipulating it to a large size more than 5 GB approx.. and writing it to another csv using csv module(write).reading data is fine but after the data is processed.it becomes large more than million records.so while running the code it shows memory error.so I'm trying to write it in chunks. I'm not using pandas as it is slow I tried it before it takes days to complete..so I'm using csv module to read and write. I don't know how to write to csv file in chunks to avoid memory issue.

I want to write to csv file in chunks. I'm first reading the csv header and writing it first and then I'm looping the other body data to the same csv file. How to write this body data in chunks using csv module not pandas.

import csv

def getmodified(make):
    initialize = int(make[0])
    ending = int(make[1])
    m = len(make[0])
    e = "{0:0" str(m) "}"
    ret = []
    for l in range(initialize,ending 1):
        ret  = [e.format(l)]
    return ret
     

with open("input.csv", "r") as f:
    reader = csv.reader(f)
    writer = csv.writer(open("out.csv.csv","a", newline=""))
    header = next(reader)
    writer.writerow(header ["column_modified"])
    for r in reader:
        to_change = r[0]
        if "(" not in to_change:
            writer.writerow(r [to_change])
        else:    
         static = to_change[:to_change.index("(")]
         make = to_change[len(static) 1:-1].split("-") 
         modify = getmodified(make)
         for each in modify:
            modified = r [static each]
            writer.writerow(modified)

In getmodified(make) function I'm doing the ranges adding all i m getting error after I'm writing it to the CSV .it is able to write 2.7GB data..then it shows memory error

input.csv

out.csv

CodePudding user response：

You are on the right path: reading a row, processing it, and writing the processed results out immediately using the csv module. I don't know of a less memory-intensive approach to processing a (large) CSV file in Python.

I think the single biggest problem your program has is that it's not doing the correct thing.

Given this simple input:

column
1234(00-01)
1234(00-01)

I believe you expect this output:

column	column_modified
1234(00-01)	123400
1234(00-01)	123401

But when I run your program on that input I get:

column	column_modified
1234(00-01)	123400
1234(00-01)	123401
1234(00-01)	123400
1234(00-01)	123401

Why?

Your getmodified() function is correct, and the loop that applies the modifications is correct, but you run that function and the loop for every row when you should be running the function and looping over sets of rows that are a range you need to expand.

This has the effect of squaring every set of rows that represents a range to expand:

column
1234(00-02)
1234(00-02)
1234(00-02)
5678(101-110)
5678(101-110)
5678(101-110)
5678(101-110)
5678(101-110)
5678(101-110)
5678(101-110)
5678(101-110)
5678(101-110)
5678(101-110)

Your program run on those 13 rows will produce 109 rows (3x3 10x10):

3 rows for every row of 1234(00-02)
10 rows for every row of 5678(101-110)

That is adding to your runtime and slowing things down; it also magnifies a small memory problem in getmodified().

I created a small test setup that measures the memory consumption of your program run against larger and larger ranges. The test files look like this:

test-0_9.csv
============
column
1234(0-9)

test-00_99.csv
==============
column
1234(00-99)

and on, up to test-000000_999999.csv (1 row expanded into 1M rows).

Using 0_9 as a baseline for memory usage (100%), here's how much more memory your program consumes when run on larger and larger ranges:

0_9-result.txt            100.00%
00_99-result.txt          100.54%
000_999-result.txt        104.29%
0000_9999-result.txt      109.00%
00000_99999-result.txt    192.68%
000000_999999-result.txt  643.30%
Most mem used: 82.16MB

Given the nature of the problem, this should be flat—processing a range of 1M should consume about as much memory as processing a range of 10.

I tweaked your program:

I got rid of getmodified() in favor of doing everything in the else-clause
the else-clause now splits the range into lo and hi, converts lo and hi to the ints lower and upper, loops over the range lower to upper (plus 1), and converts each number to a padded string

...

else:
    static = to_change[: to_change.index("(")]
    lo, hi = to_change[len(static)   1 : -1].split("-")
    lower = int(lo)
    upper = int(hi)
    for num in range(lower, upper   1):
        padded_num = "{0:{pad_len}}".format(num, pad_len=len(lo))
        modified = r   [static   padded_num]
        writer.writerow(modified)

From my tests this produces the same output CSV, and now my memory usage curve looks flat:

0_9-result.txt            100.00%
00_99-result.txt          100.27%
000_999-result.txt        101.08%
0000_9999-result.txt      102.13%
00000_99999-result.txt     99.74%
000000_999999-result.txt  101.31%
Most mem used: 6.03MB

I didn't actually profile this, so I cannot say for certain what was happening with your program, but passing the list of padded strings back from getmodified() required creating those lists, which take up memory, which then have to be garbage collected. You had lists of tens, hundreds, thousands-or-more?, strings sitting in memory, as your program was making n-sqaured calls to getmodified() for each n-sized ranges. I imagine this is what ultimately exhausted your memory.

Note that my tweak for the memory usage doesn't fix the correctness of your program. You still need to figure this out. The easiest thing I can think of would be to unique-ify the input in a first, pre-processing pass, so that:

column
1234(01-03)
1234(01-03)
1234(01-03)
32156
543(000-004)
543(000-004)
543(000-004)
543(000-004)
543(000-004)

becomes:

column
1234(01-03)
32156
543(000-004)

then feed that to your program... it will still duplicate rows for each range it processes, but I believe you'll end up with the correct output:

column,column_modified
1234(01-03),123401
1234(01-03),123402
1234(01-03),123403
32156,32156
543(000-004),543000
543(000-004),543001
543(000-004),543002
543(000-004),543003
543(000-004),543004

Final thoughts

I don't see the need to open the output CSV in "append" mode. I read in a previous version of this post that you were using append because you wanted to see the results as they were processing, and you wanted to see them sooner than later. Do I remember correctly? Append doesn't control how quickly the csv module/Python writes to a file. That's a buffering issue, and Python will do the best it can reduce the number of writes (and thus the latency of the entire process). In my opinion, append is only for logging: where writes are intermittent, and the process making the writes will open and close the files numerous times. Append only means "start at the end".

And, finally, I know you've been trying to get this correct and have been having issues. Throughout the past 3 posts of this problem I get the sense that you're very anxious and have been a bit overwhelmed by how long this has taken. That said, I kindly urge you to try and slow your thinking down and approach this in measured steps. I especially feel compelled to say this because you missed a very big issue in your logic that was compounding your problems. Did you run your progam on the samples you provided us? It was the first thing I did. It took me about a minute to understand why the output file was so big... but when I did, suddenly everything else started making sense.

Good luck :)

CodePudding user response：

Use the csv module in combination with the enumerate function.

See this for general idea.

import csv

# Set the chunk size (number of rows to read/write at a time)
CHUNK_SIZE = 1000

# Open the input and output files
with open("input.csv", "r") as input_file, open("output.csv", "w") as output_file:
    # Create a CSV reader and writer
    reader = csv.reader(input_file)
    writer = csv.writer(output_file)

    # Process the data in chunks
    for i, rows in enumerate(reader, start=1):
        # Process the rows
        writer.writerows(processed_rows)

        # Print a message every CHUNK_SIZE rows
        if i % CHUNK_SIZE == 0:
            print(f"Processed {i} rows")

It will read and write the data in the input and output files in chunks of CHUNK_SIZE rows at a time and print a message every time it processes CHUNK_SIZE rows to give you an idea of its progress.

You can adjust the chunk size to suit your needs. A larger chunk size may be more efficient, but may also consume more memory and you may run into memory erros as you have stated above.