I have two dataframes, the first is the data I currently have in the database, the second would be a file that might have changed fields:
name
and/or cnpj
and/or create_date
Based on that, I need to create a third dataframe with only the rows that have undergone some kind of change, as in the example of the expected output.
The key to making the comparisons needs to be:
id_account
Dataframe 1:
id_account | name | cnpj | create_date |
---|---|---|---|
10 | Agency Criss | 10203040 | 2022-05-30 |
20 | Agency Angel | 11213141 | 2022-05-30 |
30 | Supermarket Mario Bros | 12223242 | 2022-05-30 |
40 | Agency Mister M | 13233343 | 2022-05-30 |
50 | Supermarket Pokemon | 14243454 | 2022-05-30 |
60 | Supermarket of Dreams | 15253580 | 2022-05-30 |
Dataframe 2:
id_account | name | cnpj | create_date |
---|---|---|---|
10 | Supermarket Carol | 80502030 | 2022-05-30 |
20 | Agency Angel | 11213141 | 2022-05-30 |
30 | Supermarket Mario Bros | 12223242 | 2022-05-30 |
40 | Supermarket Magical | 60304050 | 2022-05-30 |
50 | Supermarket Pokemon | 14243454 | 2022-05-30 |
60 | Supermarket of Dreams | 90804050 | 2022-05-30 |
Expected output:
id_account | name | cnpj | create_date |
---|---|---|---|
10 | Supermarket Carol | 80502030 | 2022-05-30 |
40 | Supermarket Magical | 60304050 | 2022-05-30 |
60 | Supermarket of Dreams | 90804050 | 2022-05-30 |
How can I do this? I've looked for a few ways, but I'm confused by the index.
CodePudding user response:
If the data has same columns, but different number of rows, this is one possible solution:
res = (pd.concat([df1,df2])
.drop_duplicates(keep=False)
.drop_duplicates(subset='id_account', keep='last')
)
Output:
id_account name cnpj create_date
0 10 Supermarket Carol 80502030 2022-05-30
3 40 Supermarket Magical 60304050 2022-05-30
5 60 Supermarket of Dreams 90804050 2022-05-30
CodePudding user response:
If your two dataframes have the same columns and the same number of rows, you can check equate them and use any(axis=1)
to find rows where any columns changes:
new_df = df2[(df1 != df2).any(axis=1)]
Output:
>>> new_df
id_account name cnpj create_date
0 10 Supermarket Carol 80502030 2022-05-30
3 40 Supermarket Magical 60304050 2022-05-30
5 60 Supermarket of Dreams 90804050 2022-05-30