I want to compare two excel files that have three columns. One file is from previous day and other file is from today (both have identical column names). How do I compare both to show what has been added in and removed from previous day? Note: I just want to focus on the first column (Stack Name) E.g. if stack name xyz is in 08102022.csv and not in 08112022.csv then I want a table that outputs that and shows me it is a deleted stack. Vice versa if it wasn't there in yesterday's csv and is there today then I want it to output and show me it is a new stack.
newstack_additions = pd.concat([dfprevious,dftoday]).drop_duplicates(subset = ['Stack Name'], keep=False)
print(newstack_additions)
newstack_additions["Stack Change Type"] = "New Stack"
newstack_additions['Last Modified'] = pd.to_datetime(newstack_additions['Last Modified'], format= '%m/%d/%Y-%H:%M:%S')
The above works and shows me only the row that has been added in but logically this is only just returning what doesn't have a duplicate after both dataframes are concatenated. It isn't considering anything as 'new' or 'deleted'.
This is 08102022.csv
Stack Name ... Last Modified
0 prod/account/cloudformation/AWSAccountBase.yaml ... 03/15/2022-02:16:48
1 prod/account/cloudformation/AWSAccountBaseAddi... ... 03/26/2022-02:57:56
2 prod/account/cloudformation/AWSAccountCloudCus... ... 03/04/2022-02:14:01
3 prod/account/cloudformation/AWSAccountCloudCus... ... 09/03/2020-02:11:29
4 prod/account/cloudformation/AWSAccountInfo.yaml ... 09/03/2020-02:11:29
... ... ...
3139 prod/v003/util/unix-engineering/SSMAutomationR... ... 05/16/2020-00:51:32
3140 prod/v003/util/unix-engineering/SSMPetsBSCReme... ... 05/16/2020-00:51:32
3141 prod/v003/util/unix-engineering/SSMSudoStateMa... ... 05/16/2020-00:51:32
3142 prod/v003/util/unix-engineering/SudoLambdaDepl... ... 05/16/2020-00:51:32
3143 prod/v003/util/unix-engineering/linux_bsc_reme... ... 05/16/2020-00:51:32
[3144 rows x 3 columns]
This is 08112022.csv
Stack Name ... Last Modified
0 prod/account/cloudformation/AWSAccountBase.yaml ... 03/15/2022-02:16:48
1 prod/account/cloudformation/AWSAccountBaseAddi... ... 03/26/2022-02:57:56
2 prod/account/cloudformation/AWSAccountCloudCus... ... 03/04/2022-02:14:01
3 prod/account/cloudformation/AWSAccountCloudCus... ... 09/03/2020-02:11:29
4 prod/account/cloudformation/AWSAccountInfo.yaml ... 09/03/2020-02:11:29
... ... ...
3140 prod/v003/util/unix-engineering/SSMAutomationR... ... 05/16/2020-00:51:32
3141 prod/v003/util/unix-engineering/SSMPetsBSCReme... ... 05/16/2020-00:51:32
3142 prod/v003/util/unix-engineering/SSMSudoStateMa... ... 05/16/2020-00:51:32
3143 prod/v003/util/unix-engineering/SudoLambdaDepl... ... 05/16/2020-00:51:32
3144 prod/v003/util/unix-engineering/linux_bsc_reme... ... 05/16/2020-00:51:32
[3145 rows x 3 columns]
and I want (shows that below was not in 08102022.csv and has been added in 08112022.csv):
Stack Name ... Stack Change Type
1700 prod/landing-zone/splunk/SplunkDataIngestion.yaml ... New Stack
[1 rows x 4 columns]
Similarly, I would like to show what was in 08102022.csv and is no longer in 08112022.csv.
CodePudding user response:
Try this:
df2[~df2.index.isin(df1.index)]
Where df2 is the last one
Other one:
x = pd.concat([df1, df2])
y = x.drop_duplicates(keep=False, inplace=False)
CodePudding user response:
It seems like you need to find a difference between two dfs. Possible answer: Python Pandas - Find difference between two data frames