Home > Mobile >  How to compare 2 CSV files and output missing rows when order of rows are not organized
How to compare 2 CSV files and output missing rows when order of rows are not organized

Time:07-21

I have two CSV files: CSV1 CSV2

CSV2 has 32 rows, CSV1 has 25 rows, so I need to know the 7 exact rows that are in CSV2, but not CSV1. Currently, my set() code is removing the duplicates in both CSVs which I need to not happen. Additionally, because the additional rows found in CSV2 cause the order of the rows to not match the 25 rows in CSV1, it automatically outputs all additional rows following the first missing row as missing. Here is my current logic:

import csv
import pandas as pd
import numpy as np

data1 = pd.read_csv('CSV2.csv')
data2 = pd.read_csv('CSV1.csv')

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df = df2[~df2.eq(df1).all(axis=1)]

print(df)

Here is what I want the outcome to be:

Outcome

I need to ensure duplicates are not removed from either file so I can see all 7 rows present in CSV2 that are not in CSV1. Thank you for your help.

CodePudding user response:

See JOINING, it's a good terminology for communication.

Pandas offers similar operation to SQL join, see this.

And answer to your question: How to do "(df1 & not df2)" dataframe merge in pandas?

CodePudding user response:

(Assuming df2 is the larger dataset of which we need the 7 rows)

df = df2[~df2.isin(df1)].dropna().reset_index(drop=True)
print(df)

The dataframe "df" will have the 7 exact rows that are in df2 but not df1

  • Related