I am comparing two CSV files. I need to know which rows match by comparing specific columns. The output needs to be the rows that match.
Data:
CSV 1:
name, age, occupation
Alice,51,accountant
John,23,driver
Peter,32,plumber
Jose,50,doctor
CSV 2:
name, age, occupation
Alice,51,dentist
Ted,43,carpenter
Peter,32,plumber
Jose,50,doctor
Desired Ouput:
Rather than returning a boolean, I would like to output the matching rows when only comparing the name and age columns:
Alice,51,accountant
Alice,51,dentist
Peter,32,plumber
Jose,50,doctor
Code:
I am comparing the two CSVs to see if the columns in the 'columns_to_compare_test' list match. I return a boolean, true or false.
# read two csv files into dataframes
df1 = pd.read_csv('test.csv', sep=',', encoding='UTF-8')
df2 = pd.read_csv('test2.csv', sep=',', encoding='UTF-8')
# compare only the name and age columns
columns_to_compare_test = ['name', 'age']
# print true or false
print(df1[columns_to_compare_test].equals(df2[columns_to_compare_test]))
# output: false
Thank you.
CodePudding user response:
I'd suggest the following:
import pandas as pd
# load csv
df1 = pd.read_csv('test.csv', sep=',', encoding='UTF-8')
df2 = pd.read_csv('test2.csv', sep=',', encoding='UTF-8')
# look for matching rows
filter = ['name', 'age']
filter = df1[filter].eq(df2[filter]).all(axis=1)
df1 = df1[filter].append(df2[filter])
# remove duplicates e.g. Jose
df1 = df1.drop_duplicates()
print(df1.head())
You are using equals to check if the row is in the other df on an element basis. all()
will make sure, that both age
and name
match.
You can finally use the resulting series to select the relevant rows in both dfs and append them. This will result in the following output:
name age occupation
0 Alice 51 accountant
2 Peter 32 plumber
3 Jose 50 doctor
0 Alice 51 dentist
Please let me know, if you wanted to achieve a different effect.