Diff two CSVs by specific columns, output matching rows-CodePudding

I am comparing two CSV files. I need to know which rows match by comparing specific columns. The output needs to be the rows that match.

Data:

CSV 1:

name, age, occupation
Alice,51,accountant
John,23,driver
Peter,32,plumber
Jose,50,doctor

CSV 2:

name, age, occupation
Alice,51,dentist
Ted,43,carpenter
Peter,32,plumber
Jose,50,doctor

Desired Ouput:

Rather than returning a boolean, I would like to output the matching rows when only comparing the name and age columns:

Alice,51,accountant 
Alice,51,dentist 
Peter,32,plumber
Jose,50,doctor

Code:

I am comparing the two CSVs to see if the columns in the 'columns_to_compare_test' list match. I return a boolean, true or false.

# read two csv files into dataframes
df1 = pd.read_csv('test.csv', sep=',', encoding='UTF-8')
df2 = pd.read_csv('test2.csv', sep=',', encoding='UTF-8')

# compare only the name and age columns
columns_to_compare_test = ['name', 'age']

# print true or false
print(df1[columns_to_compare_test].equals(df2[columns_to_compare_test]))

# output: false

Thank you.

CodePudding user response：

I'd suggest the following:

import pandas as pd

# load csv
df1 = pd.read_csv('test.csv', sep=',', encoding='UTF-8')
df2 = pd.read_csv('test2.csv', sep=',', encoding='UTF-8')

# look for matching rows
filter = ['name', 'age']
filter = df1[filter].eq(df2[filter]).all(axis=1)
df1 = df1[filter].append(df2[filter])

# remove duplicates e.g. Jose
df1 = df1.drop_duplicates()

print(df1.head())

You are using equals to check if the row is in the other df on an element basis. all() will make sure, that both age and name match.

You can finally use the resulting series to select the relevant rows in both dfs and append them. This will result in the following output:

    name  age  occupation
0  Alice   51  accountant
2  Peter   32     plumber
3   Jose   50      doctor
0  Alice   51     dentist

Please let me know, if you wanted to achieve a different effect.