How to find those rows which don't exist in another CSV file using python 3.7.5-CodePudding

I have a file ua.csv which has 2 rows and another file pr.csv which has 4 rows. I would like to know what are those rows which are present in pr.csv and ua.csv doesn't. Need to have count of extra rows present in pr.csv in the output.

ua.csv

Name|Address|City|Country|Pincode
Jim Smith|123 Any Street|Boston|US|02134 
Jane Lee|248 Another St.|Boston|US|02130

pr.csv

Name|Address|City|Country|Pincode
Jim Smith|123 Any Street|Boston|US|02134 
Smoet|coffee shop|finland|Europe|3453335
Jane Lee|248 Another St.|Boston|US|02130 
Jack|long street|malasiya|Asia|585858

Below is the expected output:

pr.csv has 2 rows extra

Name|Address|City|Country|Pincode
Smoet|coffee shop|finland|Europe|3453335
Jack|long street|malasiya|Asia|585858

CodePudding user response：

I guess you could use the set datastructure:

ua_set = set()
pr_set = set()

# Code to populate the sets reading the csv files (use the `add` method of sets)
...

# Find the difference
diff = pr_set.difference(ua_set)

print(f"pr.csv has {len(diff)} rows extra")

# It would be better to not hardcode the name of the columns in the output 
# but getting the info depends on the package you use to read csv files
print("Name|Address|City|Country|Pincode")  

for row in diff:
    print(row)

A better solution using the pandas module:

import pandas as pd

df_ua = pd.read_csv("ua.scv") # Must modify path to ua.csv
df_pr = pd.read_csv("pr.csv") # Must modify path to pr.csv

df_diff = df_pr.merge(df_ua, how="outer", indicator=True).loc[lambda x: x["_merge"] == "left_only"].drop("_merge", axis=1)

print(f"pr.csv has {len(df_diff)} rows extra")

print(df_diff)

CodePudding user response：

import csv
ua_dic={}
with open('ua.csv') as ua:
  data=csv.reader(ua,delimiter=',')

  for i in data:
    if str(i) not in ua_dic:
        ua_dic[str(i)]=1

output=[]
with open('pr.csv') as pr:
  data=csv.reader(pr,delimiter=',')

  for j in data:
    if str(j) not in ua_dic:
        output.append(j)

  print(output)