Count rows with column value greater than - Big csv python-CodePudding

I have a very large csv of around 35 million lines.

I have a simple python that counts me the total lines :

file_name = 'extract.csv'    
start = time.time()

with open(file_name) as f:
    line_count = sum(1 for line in f)

print(line_count)

The executable is very fast, and works well without memory blocks.

Now I need to count the rows which contain greater or less value than X in column Y.

How can I solve without crashes and without slowing down the execution too much?

CodePudding user response：

Assuming your column is called col_y and your values are integers, you can use the python csv module.

import csv

threshold = 10
file_name = "extract.csv"

with open(file_name, newline="") as csvfile:
    reader = csv.DictReader(csvfile)
    line_count = sum(1 for row in reader if int(row.get("col_y", 0)) > threshold)

print(line_count)

Or for counting multiple things:

import csv

threshold = 10
above = 0
below = 0
equal = 0
file_name = "extract.csv"

with open(file_name, newline="") as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        val = int(row.get("col_y", 0))
        if val > threshold:
            above  = 1
        elif val < threshold:
            below  = 1
        elif val == threshold:
            equal  = 1
        else:
            print("Something isn't right")

print(f"{above:,} above")
print(f"{below:,} below")
print(f"{equal:,} equal")

These snippets use a cvs.DictReader. You can take a look at the docs to get an idea of how you might need to edit these examples to suit your needs.