I have to compare to csv files, which I need to drop the duplicate rows and generate another file.
#here I´m comparing the csv files. The oldest_file and the newest_file
different_data_type = newest_file.equals(other = oldest_file)
#If they have differences, I concat them to drop those rows that are equals
merged_files = pd.concat([oldest_file, newest_file])
merged_files = merged_files.drop_duplicates()
print(merged_files())
Each csv file has about 5.000 rows, and when I print merged_files, I´m receiving a 10.000 row csv file. In other words, it´s not dropping.
How can I get only the rows that has differences?
CodePudding user response:
I think you are missing to indicate columns in drop_duplicates()
, try using like
df.drop_duplicates(subset=['column1', 'column2'])
One other way is to find duplicates in your merged file and then delete them from merged_files:
duplicate_rows = merged_files.duplicated(subset=['column1', 'column2'])
merged_files = merged_files[~duplicate_rows]