I have a very large csv of around 35 million lines.
I have a simple python that counts me the total lines :
file_name = 'extract.csv'
start = time.time()
with open(file_name) as f:
line_count = sum(1 for line in f)
print(line_count)
The executable is very fast, and works well without memory blocks.
Now I need to count the rows which contain greater or less value than X in column Y.
How can I solve without crashes and without slowing down the execution too much?
CodePudding user response:
Assuming your column is called col_y
and your values are integers, you can use the python csv
module.
import csv
threshold = 10
file_name = "extract.csv"
with open(file_name, newline="") as csvfile:
reader = csv.DictReader(csvfile)
line_count = sum(1 for row in reader if int(row.get("col_y", 0)) > threshold)
print(line_count)
Or for counting multiple things:
import csv
threshold = 10
above = 0
below = 0
equal = 0
file_name = "extract.csv"
with open(file_name, newline="") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
val = int(row.get("col_y", 0))
if val > threshold:
above = 1
elif val < threshold:
below = 1
elif val == threshold:
equal = 1
else:
print("Something isn't right")
print(f"{above:,} above")
print(f"{below:,} below")
print(f"{equal:,} equal")
These snippets use a cvs.DictReader
. You can take a look at the docs to get an idea of how you might need to edit these examples to suit your needs.