Querying data frames in Python/Pandas when columns are optional or missing-CodePudding

I'm developing a script in Python/Pandas to compare the contents of two dataframes.

Both dataframes contain any combination of columns from a fixed list, for instance:

"Case Name", "MAC", "Machine Name", "OS", "Exec Time", "RSS"

Some combination of columns are used as a unique key, but some of those columns might be missing some times. Also, both dataframes contain (and miss) the same columns (to avoid extra complexity).

So, I want to retrieve a row from one dataframe given a key I obtain from the other dataframe (I'm certain the key matches a single row in each dataframe, that's not an issue in this case either).

For example, in this case, the pair "Case Name" "MAC" is my key:

Dataframe 1:

"Case Name" | "MAC"             |"Machine Name" | "OS"    | "Exec Time" | "RSS"
------------ ------------------- --------------- --------- ------------- ------
      Case1 | FB:E8:99:88:AC:DE |        Linux1 |   Linux |          60 |  1000

DataFrame 2

"Case Name" | "MAC"             |"Machine Name" | "OS"    | "Exec Time" | "RSS"
------------ ------------------- --------------- --------- ------------- ------
      Case1 | FB:E8:99:88:AC:DE |      Windows1 | Windows |          80 |   500

Based on these dataframes, I want to generate another one like this:

"Case Name" | "MAC"             | "Machine Name 1" | "Machine Name 2" | "OS 1"    | "OS 2"    | "Exec Time 1" | "Exec Time 2" | "RSS 1" | "RSS 2" 
------------ ------------------- ------------------ ------------------ ----------- ----------- --------------- --------------- --------- --------
      Case1 | FB:E8:99:88:AC:DE |           Linux1 |         Windows1 |     Linux |   Windows |            60 |            80 |    1000 |     500

However, in certain cases, some of those "key" columns might be missing, in which case the dataframes will look like:

Dataframe 1:

"Case Name" | "Machine Name" | "OS"    | "Exec Time" | "RSS"
------------ ---------------- --------- ------------- ------
      Case1 |         Linux1 |   Linux |          60 |  1000

DataFrame 2:

"Case Name" | "Machine Name" | "OS"    | "Exec Time" | "RSS"
------------ ---------------- --------- ------------- ------
      Case1 |       Windows1 | Windows |          80 |   500

As you can see, the "MAC" column is missing, in which case I'm certain (this is not an issue here either) that "Case Name" is a good enough unique key.

So, to build the combined data frame I tried something like this:

for index1, data1 in dataFrame1.iterrows():
    caseName       = data1['Case Name']
    try:
        macAddr        = data1['MAC']
    except:
        macAddr        = None
        
    # Let's see if pd.isnull() works fine when no MAC column exists
    if pd.isnull(macAddr):
        print("No MAC column data detected")
    else:
        print("MAC column data detected")
        
    # The rest of the data from the dataFrame1
    machineName1  = data1['Machine Name']
    os1           = data1['OS']
    # etc., etc.

    #then try to locate the equivalent data in the other data frame:
    data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName) & (pd.isnull(macAddr) | (dataFrame2['MAC'] == macAddr)), ['Machine Name', 'OS', 'Exec Time', 'RSS']]
    
    machineName2  = data2['Machine Name']
    os2           = data2['OS']
    # etc., etc.

As a C-based guy (and very beginner at Python) as I am, I'd expect the sentence to stop processing once a True condition is reached, in this case, pd.isnull(macAddr), avoiding to perform the piece that will certainly trigger an error, (dataFrame2['MAC'] == macAddr), since the column is missing. According to this, I'd expect so, however, it doesn't seem to happen in my case, and when I run it, my script returns:

caseName  = testCase
No MAC column data detected -> So pd.isnull() works fine!!!
Traceback (most recent call last):
  File "~/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'MAC'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "~/compare_dataframes.py", line 135, in <module>
    main()
  File "~/compare_dataframes.py", line 79, in main
    data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName) & (pd.isnull(macAddr) | (dataFrame2['MAC'] == macAddr)), ['Machine Name', 'OS', 'Exec Time', 'RSS']]
  File "~/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3458, in __getitem__
    indexer = self.columns.get_loc(key)
  File "~/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 'MAC'

Now, I can change this to a serie of nested if conditions:

if pd.isnull(macAddr):
    data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName), ['Machine Name', 'OS', 'Exec Time','RSS']]
else:
    data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName) & (dataFrame2['MAC'] == macAddr), ['Machine Name', 'OS', 'Exec Time','RSS']]

But this is impractical since it becomes a 2^n if, what happens if I add a new column in the future?

So, my question is: What is wrong with that condition? I'd added parenthesis as much as possible, but to no effect.

I'm using Python 3.8, Pandas 1.3.4

Thanks a lot for your help.

CodePudding user response：

I think you might want to try a simple merge, and depending on whether MAC is present, add it to the merge fields, no?

merge_cols = ["Case Name"]
if "MAC" in df1.columns:
    merge_cols.append("MAC")

print(merge_cols)
result = df1.merge(df2, on=merge_cols, how='left')
print(result)

Does that answer your question?

CodePudding user response：

Try this way Get 1st dataframe. And merge it with 2nd
Example

Merged_df1=df1.merge(df2,how=''outer", on=["Case Name"])
Merged_df2=df1.merge(df2,how=''outer", on=["MAC"]) Append these 2 data frames appended_df=Merged_df1.append(Merged_df2)

Then drop duplicates

appended_df .drop_duplicates(subset["Case Name", "MAC", "Machine Name", "OS", "Exec Time", "RSS"])

Note: in duplicates write all column names, that are present in appended_df