First of all, I know similar questions have been asked in the past, but none of them apply to my use case, not I know Pandas enough to get there on my own. So, I have 2 spreadsheets (xlsx), let's call them file 1 and file 2
File1:
ID | IP | Name | Comments | OS | Title | Status |
---|---|---|---|---|---|---|
1 | 127.0.0.1 |
Something | Windows 3.11 | Title 1 | Active | |
2 | 127.0.0.1 |
Something else | Windows 3.11 | Title 1 | Fixed | |
3 | 127.0.0.1 |
New one | Windows 3.11 | Title 1 | Fixed | |
4 | 127.0.0.1 |
Yet another one | Windows 3.11 | Title 1 | Active |
File2:
ID | IP | Name | Comments | OS | Title | Status |
---|---|---|---|---|---|---|
1 | 127.0.0.1 |
Something | Windows 3.11 | Title 1 | Fixed | |
4 | 127.0.0.1 |
Yet another one | Windows 3.11 | Title 1 | Fixed | |
5 | 127.0.0.1 |
Uh-oh | Windows 3.11 | Title 1 | Active |
I'm merging these 2 files in Python, with Pandas:
df1 = pd.read_excel('file1.xlsx')
df2 = pd.read_excel('file2.xlsx')
#I just need a column with the file name, hence these variables
file1 = "file1.xlsx"
file2 = "file2.xlsx"
df1 = df1.sort_values('ID')
#Adds a new column with the file name
df1["File"] = file1
df2 = df2.sort_values('ID')
#Adds a new column with the file name
df2["File"] = file2
dfs = [df1, df2]
#Black magic to merge the 2 files into 1
df_diff = pd.concat(dfs).drop_duplicates(keep=False, subset=['IP', 'Name', 'Comments', 'ID', 'Title', 'Status'])
df_diff = df_diff.drop(df_diff[])
df_diff.to_excel(r'file3.xlsx')
And this is the resulting file:
ID | IP | Name | Comments | OS | Title | Status | File |
---|---|---|---|---|---|---|---|
1 | 127.0.0.1 |
Something | Windows 3.11 | Title 1 | Active | File 1 | |
1 | 127.0.0.1 |
Something | Windows 3.11 | Title 1 | Fixed | File 2 | |
2 | 127.0.0.1 |
Something else | Windows 3.11 | Title 1 | Fixed | File 1 | |
3 | 127.0.0.1 |
New one | Windows 3.11 | Title 1 | Fixed | File 1 | |
4 | 127.0.0.1 |
Yet another one | Windows 3.11 | Title 1 | Active | File 1 | |
4 | 127.0.0.1 |
Yet another one | Windows 3.11 | Title 1 | Fixed | File 2 | |
5 | 127.0.0.1 |
Uh-oh | Windows 3.11 | Title 1 | Active | File 2 |
As you can see, in the new file there is a new column indicating what file that row comes from (row File). Now, I would like Pandas to check for the following condition:
any row present in file 1 only, with the Status set to fixed.
I would like to drop from file 3 any rows matching the above condition, so that I end up with the following table:
ID | IP | Name | Comments | OS | Title | Status | File |
---|---|---|---|---|---|---|---|
1 | 127.0.0.1 |
Something | Windows 3.11 | Title 1 | Active | File 1 | |
1 | 127.0.0.1 |
Something | Windows 3.11 | Title 1 | Fixed | File 2 | |
4 | 127.0.0.1 |
Yet another one | Windows 3.11 | Title 1 | Active | File 1 | |
4 | 127.0.0.1 |
Yet another one | Windows 3.11 | Title 1 | Fixed | File 2 | |
5 | 127.0.0.1 |
Uh-oh | Windows 3.11 | Title 1 | Active | File 2 |
How could I achieve this? As an addendum, I would also like the following condition to be met:
any row present in both file, with the only difference being in file 1 Status is Active and in file 2 Status is Fixed
CodePudding user response:
The first condition is actually pretty simple but a bit unintuitive. Selecting based on multiple criteria works pretty much the same as with a single criterium, but you just add a binary & and put every criterium between parentheses:
df[ (df['File'] == 'File 1') & (df['Status'] == 'Fixed') ]
It has to be '&' and not 'and'. The second condition can be met in a similar way:
IDs_file1_active = df[ (df['File'] == 'File 1') & (df['Status'] == 'Active') ]['ID']
selects the ID of every Active entry from file 1. Now you can selects the rows with their IDs from the original table:
file1_active = df[ df['ID'].isin(IDs_file1_active) ]
and apply the second condition:
file1_active [ (file1_active ['File'] == 'File 2') & (file1_active ['Status'] == 'Fixed') ]
This should do what you're asking.
CodePudding user response:
Use conditionals for selecting. To exclude use tilde
cond=(df['File']=='File 1') & (df['Status'] =='Fixed')
df[~cond]
CodePudding user response:
Let's suppose your last dataset is call de df
So, you can try the following code
mask = (df["File"]=="File 1") & (df["Status"]=="Fixed")
df = df[~mask]
df
ID | IP | Name | Comments | OS | Title | Status | File |
---|---|---|---|---|---|---|---|
1 | 127.0.0.1 |
Something | Windows 3.11 | Title 1 | Active | File 1 | |
1 | 127.0.0.1 |
Something | Windows 3.11 | Title 1 | Fixed | File 2 | |
4 | 127.0.0.1 |
Yet another one | Windows 3.11 | Title 1 | Active | File 1 | |
4 | 127.0.0.1 |
Yet another one | Windows 3.11 | Title 1 | Fixed | File 2 | |
5 | 127.0.0.1 |
Uh-oh | Windows 3.11 | Title 1 | Active | File 2 |