How to group columns and sum them, in a large CSV?-CodePudding

I have a large CSV (hundreds of millions of rows) and I need to sum the Value column based on the grouping of the ID, Location, and Date columns.

My CSV is similar to:

    ID Location        Date  Value
 1   1     Loc1  2022-01-27      5
 2   1     Loc1  2022-01-27      4
 3   1     Loc1  2022-01-28      7
 4   1     Loc2  2022-01-29      8
 5   2     Loc1  2022-01-27     11
 6   2     Loc2  2022-01-28      4
 7   2     Loc2  2022-01-29      6
 8   3     Loc1  2022-01-28      9
 9   3     Loc1  2022-01-28      9
10   3     Loc2  2022-01-29      1

{ID: 1, Location: Loc1, Date: 2022-01-27} is one such group, and its sub values 5 and 4 should be summed to 9
{ID: 3, Location: Loc1, Date: 2022-01-28} is another group and its sum should be 18

Here's what that sample input should look like, processed/summed, and written to a new CSV:

ID Location        Date  Value
1     Loc1  2022-01-27      9
1     Loc1  2022-01-28      7
1     Loc2  2022-01-29      8
2     Loc1  2022-01-27     11
2     Loc2  2022-01-28      4
2     Loc2  2022-01-29      6
3     Loc1  2022-01-28     18
3     Loc2  2022-01-29      1

I know using df.groupby([columns]).sum() would give the desired result, but the CSV is so big I keep getting memory errors. I've tried looking at other ways to read/manipulate CSV data but have still not been successful, so if anyone knows a way I can do this in python without maxing out my memory that would be great!

NB: I know there is a unnamed first column in my initial csv, this is irrelevant and doesn't need to be in the outputted, but doesn't matter if it is :)

CodePudding user response：

If the lines to be concatenated are consecutive, the good old csv module allows to process huge files one line at a time, hence with a minimal memory footprint.

Here you could use:

with open('input.csv') as fd, open('output.csv', 'w', newline='') as fdout:
    rd, wr = csv.reader(fd), csv.writer(fdout)
    _ = wr.writerow(next(rd))      # header line
    old = [None]*4
    for row in rd:
        row[3] = int(row[3])       # convert value field to integer
        if row[:3] == old[:3]:
            old[3]  = row[3]       # concatenate values of similar rows     
        else:
            if old[0]:             # and write the concatenated row
                _ = wr.writerow(old)
            old = row
    if old[0]:                     # do not forget the last row...
        _ = wr.writerow(old)

With the shown input data, it gives as expected:

ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1

Not as clean and neat than Pandas code but it should process files greater than the available memory without any problem.

CodePudding user response：

You could use the built in csv library and build up the output line by line. A Counter can be used to combine and count rows with the same entries:

from collections import Counter
import csv

data = Counter()

with open('input.csv') as f_input:
    csv_input = csv.reader(f_input)
    header = next(csv_input)
    
    for row in csv_input:
        data[tuple(row[:3])]  = int(row[3])

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(header)

    for key, value in data.items():
        csv_output.writerow([*key, value])

Giving the output:

ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1

This avoids storing the input CSV in memory, only the output CSV data.

If this is also too large, a slight variation would be to output data whenever the ID column changes. This would though assume the input is in ID order:

from collections import Counter
import csv

def write_id(csv_output, data):
    for key, value in data.items():
        csv_output.writerow([*key, value])
    data.clear()


data = Counter()
current_id = None

with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)
    
    header = next(csv_input)
    csv_output.writerow(header)
    
    for row in csv_input:
        if current_id and row[0] != current_id:
            write_id(csv_output, data)
            
        data[tuple(row[:3])]  = int(row[3])
        current_id = row[0]
        
    write_id(csv_output, data)

For the given example, this would give the same output.

CodePudding user response：

Have you tried:

output = []
for key, group in df.groupby([columns]):
    output.append((key, group['a'].sum()))

pd.DataFrame(output).to_csv("....csv")

source: https://stackoverflow.com/a/54244289/7132906

CodePudding user response：

There are a number of answers already that may suffice: @MartinEvans and @Corralien both recommend breaking-up/chunking the input-output. I'm especially curious if @MartinEvans's answer works within your memory constraints: it's the simplest and still-correct solution so far (as I see it).

If either of those don't work, I think you'll be faced with the question:

What makes a chunk with all the ID/Loc/Date groups I need to count contained in that chunk, so no group crosses over a chunk and gets counted multiple times (end up with smaller sub sums, instead of a single and true sum)?

In a comment on the OP you said the input was sorted by "week number". I think this is the single deciding factor for when you have all the counts you'll get for a group of ID/Loc/Date. As the readers crosses week-group boundaries, it'll know it's "safe" to stop counting any of the groups encountered so far, and flush those counts to disk (to avoid holding on to too many counts in memory).

This solution relies on the pre-sorted-ness of your input CSV. Though, if your input was a bit out of sorts: you could run this, test for duplicate groups, re-sort, and re-run this (I see this problem as making a big, memory-constrained reducer):

import csv
from collections import Counter
from datetime import datetime


# Get the data out...
out_csv = open('output.csv', 'w', newline='')
writer = csv.writer(out_csv)

def write_row(row):
    global writer
    writer.writerow(row)


# Don't let counter get too big (for memory)
def flush_counter(counter):
    for key, sum_ in counter.items():
        id_, loc, date = key
        write_row([id_, loc, date, sum_])


# You said "already grouped by week-number", so:
# -   read and sum your input CSV in chunks of "week (number) groups"
# -   once the reader reads past a week-group, it concludes week-group is finished
#     and flushes the counts for that week-group

last_wk_group = None
counter = Counter()

# Open input
with open('input.csv', newline='') as f:
    reader = csv.reader(f)

    # Copy header
    header = next(reader)
    write_row(header)

    for row in reader:
        # Get "base" values
        id_, loc, date = row[0:3]
        value = int(row[3])

        # 2022-01-27  ->  2022-04
        wk_group = datetime.strptime(date, r'%Y-%m-%d').strftime(r'%Y-%U')

        # Decide if last week-group has passed
        if wk_group != last_wk_group:
            flush_counter(counter)
            counter = Counter()
            last_wk_group = wk_group

        # Count/sum this week-groups
        key = tuple([id_, loc, date_])
        counter[key]  = value


# Flush remaining week-group counts
flush_counter(counter)

As a basic test, I moved the first row of your sample input to the last row, like @Corralien was asking:

ID,Location,Date,Value
1,Loc1,2022-01-27,5
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,9
3,Loc1,2022-01-28,9
3,Loc2,2022-01-29,1
1,Loc1,2022-01-27,4

and I still get the correct output (even in the correct order, because 1,Loc1,2022-01-27 appeared first in the input):

ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1

CodePudding user response：

The appropriate answer is probably to use Dask but you can do with Pandas and chunk. The last_row variable is the last row of the previous chunk is case of the first row of the current chunk have the same ID, Location and Date.

chunksize = 4  # Number of rows
last_row = pd.DataFrame()  # Last row of the previous chunk

with open('data.csv') as reader, open('output.csv', 'w') as writer:

    # Write headers
    writer.write(reader.readline())
    reader.seek(0)

    for chunk in pd.read_csv(reader, chunksize=chunksize):
        df = pd.concat([last_row, chunk])
        df = df.groupby(['ID', 'Location', 'Date'], as_index=False)['Value'].sum()
        df, last_row = df.iloc[:-1], df.iloc[-1:]
        df.to_csv(writer, header=False, index=False)

    # Don't forget the last row!
    last_row.to_csv(writer, header=False, index=False)

Content of output.csv:

ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1