Python dataset - how to compare two excel files (one from previous day and other from today) to show-CodePudding

I want to compare two excel files that have three columns. One file is from previous day and other file is from today (both have identical column names). How do I compare both to show what has been added in and removed from previous day? Note: I just want to focus on the first column (Stack Name) E.g. if stack name xyz is in 08102022.csv and not in 08112022.csv then I want a table that outputs that and shows me it is a deleted stack. Vice versa if it wasn't there in yesterday's csv and is there today then I want it to output and show me it is a new stack.

newstack_additions = pd.concat([dfprevious,dftoday]).drop_duplicates(subset = ['Stack Name'], keep=False)
print(newstack_additions)
newstack_additions["Stack Change Type"] = "New Stack"
newstack_additions['Last Modified'] = pd.to_datetime(newstack_additions['Last Modified'], format= '%m/%d/%Y-%H:%M:%S')

The above works and shows me only the row that has been added in but logically this is only just returning what doesn't have a duplicate after both dataframes are concatenated. It isn't considering anything as 'new' or 'deleted'.

This is 08102022.csv

                                            Stack Name  ...        Last Modified
0       prod/account/cloudformation/AWSAccountBase.yaml  ...  03/15/2022-02:16:48
1     prod/account/cloudformation/AWSAccountBaseAddi...  ...  03/26/2022-02:57:56
2     prod/account/cloudformation/AWSAccountCloudCus...  ...  03/04/2022-02:14:01
3     prod/account/cloudformation/AWSAccountCloudCus...  ...  09/03/2020-02:11:29
4       prod/account/cloudformation/AWSAccountInfo.yaml  ...  09/03/2020-02:11:29
                                                ...  ...                  ...
3139  prod/v003/util/unix-engineering/SSMAutomationR...  ...  05/16/2020-00:51:32
3140  prod/v003/util/unix-engineering/SSMPetsBSCReme...  ...  05/16/2020-00:51:32
3141  prod/v003/util/unix-engineering/SSMSudoStateMa...  ...  05/16/2020-00:51:32
3142  prod/v003/util/unix-engineering/SudoLambdaDepl...  ...  05/16/2020-00:51:32
3143  prod/v003/util/unix-engineering/linux_bsc_reme...  ...  05/16/2020-00:51:32

[3144 rows x 3 columns]

This is 08112022.csv

           Stack Name  ...        Last Modified
0       prod/account/cloudformation/AWSAccountBase.yaml  ...  03/15/2022-02:16:48
1     prod/account/cloudformation/AWSAccountBaseAddi...  ...  03/26/2022-02:57:56
2     prod/account/cloudformation/AWSAccountCloudCus...  ...  03/04/2022-02:14:01
3     prod/account/cloudformation/AWSAccountCloudCus...  ...  09/03/2020-02:11:29
4       prod/account/cloudformation/AWSAccountInfo.yaml  ...  09/03/2020-02:11:29
                                                ...  ...                  ...
3140  prod/v003/util/unix-engineering/SSMAutomationR...  ...  05/16/2020-00:51:32
3141  prod/v003/util/unix-engineering/SSMPetsBSCReme...  ...  05/16/2020-00:51:32
3142  prod/v003/util/unix-engineering/SSMSudoStateMa...  ...  05/16/2020-00:51:32
3143  prod/v003/util/unix-engineering/SudoLambdaDepl...  ...  05/16/2020-00:51:32
3144  prod/v003/util/unix-engineering/linux_bsc_reme...  ...  05/16/2020-00:51:32

[3145 rows x 3 columns]

and I want (shows that below was not in 08102022.csv and has been added in 08112022.csv):

                                             Stack Name  ...  Stack Change Type
1700  prod/landing-zone/splunk/SplunkDataIngestion.yaml  ...          New Stack

[1 rows x 4 columns]

Similarly, I would like to show what was in 08102022.csv and is no longer in 08112022.csv.

CodePudding user response：

Try this:

df2[~df2.index.isin(df1.index)]

Where df2 is the last one

Other one:

x = pd.concat([df1, df2])
y = x.drop_duplicates(keep=False, inplace=False)

CodePudding user response：

It seems like you need to find a difference between two dfs. Possible answer: Python Pandas - Find difference between two data frames