I'm developing a script in Python/Pandas to compare the contents of two dataframes.
Both dataframes contain any combination of columns from a fixed list, for instance:
"Case Name", "MAC", "Machine Name", "OS", "Exec Time", "RSS"
Some combination of columns are used as a unique key, but some of those columns might be missing some times. Also, both dataframes contain (and miss) the same columns (to avoid extra complexity).
So, I want to retrieve a row from one dataframe given a key I obtain from the other dataframe (I'm certain the key matches a single row in each dataframe, that's not an issue in this case either).
For example, in this case, the pair "Case Name" "MAC"
is my key:
- Dataframe 1:
"Case Name" | "MAC" |"Machine Name" | "OS" | "Exec Time" | "RSS"
------------ ------------------- --------------- --------- ------------- ------
Case1 | FB:E8:99:88:AC:DE | Linux1 | Linux | 60 | 1000
- DataFrame 2
"Case Name" | "MAC" |"Machine Name" | "OS" | "Exec Time" | "RSS"
------------ ------------------- --------------- --------- ------------- ------
Case1 | FB:E8:99:88:AC:DE | Windows1 | Windows | 80 | 500
Based on these dataframes, I want to generate another one like this:
"Case Name" | "MAC" | "Machine Name 1" | "Machine Name 2" | "OS 1" | "OS 2" | "Exec Time 1" | "Exec Time 2" | "RSS 1" | "RSS 2"
------------ ------------------- ------------------ ------------------ ----------- ----------- --------------- --------------- --------- --------
Case1 | FB:E8:99:88:AC:DE | Linux1 | Windows1 | Linux | Windows | 60 | 80 | 1000 | 500
However, in certain cases, some of those "key" columns might be missing, in which case the dataframes will look like:
Dataframe 1:
"Case Name" | "Machine Name" | "OS" | "Exec Time" | "RSS"
------------ ---------------- --------- ------------- ------
Case1 | Linux1 | Linux | 60 | 1000
DataFrame 2:
"Case Name" | "Machine Name" | "OS" | "Exec Time" | "RSS"
------------ ---------------- --------- ------------- ------
Case1 | Windows1 | Windows | 80 | 500
As you can see, the "MAC"
column is missing, in which case I'm certain (this is not an issue here either) that "Case Name"
is a good enough unique key.
So, to build the combined data frame I tried something like this:
for index1, data1 in dataFrame1.iterrows():
caseName = data1['Case Name']
try:
macAddr = data1['MAC']
except:
macAddr = None
# Let's see if pd.isnull() works fine when no MAC column exists
if pd.isnull(macAddr):
print("No MAC column data detected")
else:
print("MAC column data detected")
# The rest of the data from the dataFrame1
machineName1 = data1['Machine Name']
os1 = data1['OS']
# etc., etc.
#then try to locate the equivalent data in the other data frame:
data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName) & (pd.isnull(macAddr) | (dataFrame2['MAC'] == macAddr)), ['Machine Name', 'OS', 'Exec Time', 'RSS']]
machineName2 = data2['Machine Name']
os2 = data2['OS']
# etc., etc.
As a C-based guy (and very beginner at Python) as I am, I'd expect the sentence to stop processing once a True
condition is reached, in this case, pd.isnull(macAddr)
, avoiding to perform the piece that will certainly trigger an error, (dataFrame2['MAC'] == macAddr)
, since the column is missing. According to this, I'd expect so, however, it doesn't seem to happen in my case, and when I run it, my script returns:
caseName = testCase
No MAC column data detected -> So pd.isnull() works fine!!!
Traceback (most recent call last):
File "~/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'MAC'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "~/compare_dataframes.py", line 135, in <module>
main()
File "~/compare_dataframes.py", line 79, in main
data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName) & (pd.isnull(macAddr) | (dataFrame2['MAC'] == macAddr)), ['Machine Name', 'OS', 'Exec Time', 'RSS']]
File "~/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3458, in __getitem__
indexer = self.columns.get_loc(key)
File "~/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 'MAC'
Now, I can change this to a serie of nested if
conditions:
if pd.isnull(macAddr):
data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName), ['Machine Name', 'OS', 'Exec Time','RSS']]
else:
data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName) & (dataFrame2['MAC'] == macAddr), ['Machine Name', 'OS', 'Exec Time','RSS']]
But this is impractical since it becomes a 2^n if
, what happens if I add a new column in the future?
So, my question is: What is wrong with that condition? I'd added parenthesis as much as possible, but to no effect.
I'm using Python 3.8, Pandas 1.3.4
Thanks a lot for your help.
CodePudding user response:
I think you might want to try a simple merge, and depending on whether MAC
is present, add it to the merge fields, no?
merge_cols = ["Case Name"]
if "MAC" in df1.columns:
merge_cols.append("MAC")
print(merge_cols)
result = df1.merge(df2, on=merge_cols, how='left')
print(result)
Does that answer your question?
CodePudding user response:
Try this way
Get 1st dataframe. And merge it with 2nd
Example
Merged_df1=df1.merge(df2,how=''outer", on=["Case Name"])
Merged_df2=df1.merge(df2,how=''outer", on=["MAC"])
Append these 2 data frames
appended_df=Merged_df1.append(Merged_df2)
Then drop duplicates
appended_df .drop_duplicates(subset["Case Name", "MAC", "Machine Name", "OS", "Exec Time", "RSS"])
Note: in duplicates write all column names, that are present in appended_df