How to count the changes done in new csv file compared to the previous-CodePudding

We have two csv files - new.csv and old.csv.

old.csv contains with four rows:

    abc done
    xyz done
    pqr done
    rst pending

The new.csv contains four new rows:

    abc pending
    xyz not_done
    pqr pending
    rst done

I need to use count two things without using pandas

count1 = number of entries changed from done to pending = 2 (abc, pqr)
count2 = number of entries changed from done to not_done = 1 (xyz)

CodePudding user response：

CASE 1: CSV Files are in the same order

Firstly import the two files into python lists:

oldcsv = []
with open("old.csv") as f:
    for line in f:
        oldcsv.append(line.strip().split(","))
newcsv = []
with open("new.csv") as f:
    for line in f:
        newcsv.append(line.strip().split(","))

Now you would simply iterate through both lists simultaneously, using zip(). I am assuming that both CSV files list the entries in the same order.

count1 = 0
count2 = 0
for oldentry, newentry in zip(oldcsv, newcsv):
    assert(oldentry[0] == newentry[0]) # Throw error if entry names do not match
    if oldentry[1] == "done":
        if newentry[1] == "pending":
            count1  = 1
        elif newentry[1] == "not_done":
            count2  = 1

CASE 2: CSV Files are in arbitrary order

Here, given you are going to be needing to look up entries by their names, I would use a dictionary rather than a list to store the old.csv data, mapping the entry names to their values:


# Load old.csv data into a dictionary mapping entry_name: entry_value
old_values = {}
with open("old.csv") as f:
    for line in f:
        old_entry = line.strip().split(",")
        entry_name, old_entry_value = old_entry[0], old_entry[1] 
        old_values[entry_name] = old_entry_value

count1 = 0
count2 = 0

with open("new.csv") as f:
    for line in f:
        # For each entry in new_entry, look up the corresponding old entry in old_entries, and compare their values.
        new_entry = line.strip().split(",")
        entry_name, new_entry_value = new_entry[0], new_entry[1]
        old_entry_value = old_values.get(entry_name) # Get the old value for this entry (will be None if there is no old entry)
        # Essentially same code as before:
        print(f"{entry_name}: old entry status is {old_entry_value} and new entry status is {new_entry_value}")
        if old_entry_value == "done":
            if new_entry_value == "pending":
                print("Incrementing count1")
                count1  = 1
            elif new_entry_value == "not_done":
                print("Incrementing count2")
                count2  = 1
print(count1)
print(count2)

This should work, as long as the input data is properly formatted. I am assuming each .csv file has one entry per line, and each line begins with the entry name (e.g. "abc"), then a comma, then the entry value (e.g. "done","not_done").

CodePudding user response：

Here is a pure python straightforward implementation:

import csv

with open("old.csv") as old_fl:
    with open("new.csv") as new_fl:
        old = csv.reader(old_fl)
        new = csv.reader(new_fl)
        
        old_rows = [row for row in old]
        new_rows = [row for row in new]

        # see if this is really needed
        assert len(old_rows) == len(new_rows)
        n = len(old_rows)

        # assume that left key is identical,
        # and in the same order in both files
        assert all(old_rows[i][0] == new_rows[i][0] for i in range(n))

        # once the data is guaranteed to align,
        # just count what you want
        done_to_pending = [
            f"row[{i}]( {old_rows[i][0]} )"
            for i in range(n)
            if old_rows[i][1] == "done" and new_rows[i][1] == "pending"
        ]

        done_to_notdone = [
            f"row[{i}]( {old_rows[i][0]} )"
            for i in range(n)
            if old_rows[i][1] == "done" and new_rows[i][1] == "not_done"
        ]

It uses the python native csv reader so you don't need to parse the csv yourself. Note that there are various assumptions (assert statements) throughout the code - you might need to adjust the code to handle more cases.