Home > OS >  Compare rows of two dataframes in pandas
Compare rows of two dataframes in pandas

Time:05-31

I have two dataframes, the first is the data I currently have in the database, the second would be a file that might have changed fields: name and/or cnpj and/or create_date

Based on that, I need to create a third dataframe with only the rows that have undergone some kind of change, as in the example of the expected output.

The key to making the comparisons needs to be: id_account

Dataframe 1:

id_account name cnpj create_date
10 Agency Criss 10203040 2022-05-30
20 Agency Angel 11213141 2022-05-30
30 Supermarket Mario Bros 12223242 2022-05-30
40 Agency Mister M 13233343 2022-05-30
50 Supermarket Pokemon 14243454 2022-05-30
60 Supermarket of Dreams 15253580 2022-05-30

Dataframe 2:

id_account name cnpj create_date
10 Supermarket Carol 80502030 2022-05-30
20 Agency Angel 11213141 2022-05-30
30 Supermarket Mario Bros 12223242 2022-05-30
40 Supermarket Magical 60304050 2022-05-30
50 Supermarket Pokemon 14243454 2022-05-30
60 Supermarket of Dreams 90804050 2022-05-30

Expected output:

id_account name cnpj create_date
10 Supermarket Carol 80502030 2022-05-30
40 Supermarket Magical 60304050 2022-05-30
60 Supermarket of Dreams 90804050 2022-05-30

How can I do this? I've looked for a few ways, but I'm confused by the index.

CodePudding user response:

If the data has same columns, but different number of rows, this is one possible solution:

res = (pd.concat([df1,df2])
       .drop_duplicates(keep=False)
       .drop_duplicates(subset='id_account', keep='last')
      )

Output:

   id_account                   name      cnpj create_date
0          10      Supermarket Carol  80502030  2022-05-30
3          40    Supermarket Magical  60304050  2022-05-30
5          60  Supermarket of Dreams  90804050  2022-05-30

CodePudding user response:

If your two dataframes have the same columns and the same number of rows, you can check equate them and use any(axis=1) to find rows where any columns changes:

new_df = df2[(df1 != df2).any(axis=1)]

Output:

>>> new_df
   id_account                   name      cnpj create_date
0          10      Supermarket Carol  80502030  2022-05-30
3          40    Supermarket Magical  60304050  2022-05-30
5          60  Supermarket of Dreams  90804050  2022-05-30
  • Related