Home > Enterprise >  Read CSV group by 1 column and apply sum, without pandas
Read CSV group by 1 column and apply sum, without pandas

Time:10-23

As I wrote in the title I would like to read a CSV, do on this same CSV a group by column, apply sum, after replace the old CSV with the new values using as least libraries as possible (and avoid pandas). I have come this far:

index = {}
with open('event.csv') as f:
    cr = reader(f)
    for row in cr:
        index.setdefault(row[0], []).append(int(row[1]))
f.close()

with open('event.csv', 'w', newline='\n') as csv_file:
    writer = writer(csv_file)
    for key, value in index.items():
        writer.writerow([key, value[0]])
csv_file.close()

But in this way I can make the average…and also I have to open the file twice, which doesn't seem smart to me. Here is a CSV similar to event.csv:

work1,100
work2,200
work3,200
work1,50
work3,20

Desired output:

work1,150
work2,200
work3,220

CodePudding user response:

You're actually very close. Just sum the values read while rewriting the file. Note that when using with on a file, you don't have to explicitly close them, it does it for you automatically. Also note that CSV files should be opened with newline=''—for reading and writing—as per the documentation.

import csv


index = {}

with open('event.csv', newline='') as f:
    cr = csv.reader(f)
    for row in cr:
        index.setdefault(row[0], []).append(int(row[1]))

with open('event2.csv', 'w', newline='') as csv_file:
    writer = csv.writer(csv_file)
    for key, values in index.items():
        value = sum(values)
        writer.writerow([key, value])

print('-fini-')

CodePudding user response:

Here's another way to think of it.

Instead of storing arrays of ints during reading and then "compressing" them into the desired value during writing, show up-front that you're summing something during the read:

import csv

from collections import defaultdict

summed_work = defaultdict(int)

with open('event_input.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        work_id = row[0]
        work_value = int(row[1])
        summed_work[work_id]  = work_value

with open('event_processed.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    for work_id, summed_value in summed_work.items():
        writer.writerow([work_id, summed_value])

This is functionally equivalent to what you were aiming for and what martineau helped you with, but, I argue, shows you and your reader sooner and more clearly what the intent is.

It technically uses one more library, defaultdict, but that's a standard library, and I'm not sure what value you're placing on the number of libraries being used.

CodePudding user response:

Another simplification of solutions already shown, without additional libraries:

import csv

index = {}

with open('event.csv', newline='') as f:
    cr = csv.reader(f)
    for item,value in cr:
        index[item] = index.get(item, 0)   int(value)  # sum as you go

with open('event2.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(index.items())  # write all the items in one shot

print('-fini-')

CodePudding user response:

With an additional library - convtools, which provides a lot of functionality not to write a lot of code by hand every time.

from convtools import conversion as c
from convtools.contrib.tables import Table


rows = Table.from_csv("event.csv", header=False).into_iter_rows(list)
converter = (
    c.group_by(c.item(0))
    .aggregate(
        (
            c.item(0),
            c.ReduceFuncs.Sum(c.item(1).as_type(int)),
        )
    )
    .gen_converter()
)
processed_rows = converter(rows)
Table.from_rows(processed_rows, header=False).into_csv(
    "event2.csv", include_header=False
)
  • Related