Pandas: best way to check integrity of dataframe columns across numerous copies-CodePudding

Context: I get a spreadsheet with numerous sheets. Each sheet references the same core columns but includes different subsets of the master sheet. How do I ensure each subset has the same ID married with the same secondary, tertiary column? This question starts with the assumption that we know how to load dataframes from excel:

Ex.

df_subset = pd.DataFrame([[1000, 'Jerry', 'BR1','BR1'], 
                         [1001, 'Sal', 'BR2', 'BR1'], 
                         [1005, 'Buck', 'BR3', 'BR2'],
                         [1003, 'Perry','BR4','BR1']],
                         columns=['ID', 'Name', 'Branch', 'Member of'])

df_master = pd.DataFrame([[1000, 'Jerry', 'BR1','Black'], 
                         [1001, 'Sal', 'BR2', 'Mocha'], 
                         [1002, 'Buck', 'BR3', 'Frap'],
                         [1003, 'Perry','BR4','Black'],
                         [1004, 'Mike', 'BR5', 'Mach']],
                         columns=['ID', 'Name', 'Branch', 'Coffee'])

Each of the above test frames represents a sheet. df_master is the authoritative source of "Truth". How do I catch the error in df_subset ( See "Buck" )? Note the mismatched ID.

Expected output ( open to suggestions here, but it needs to flag errors ):

Master: 'Buck': 1002
Subset: 'Buck': 1005

If no mistakes, then no output.

UPDATE:

master_list = df_master[['ID','Name']].copy()
subset_list = df_subset[['ID','Name']].copy()

df_check = pd.concat([master_list,subset_list])

df_check.drop_duplicates(keep=False, ignore_index=False, subset=None)

The above code almost works but shows records present in the master that are not used in any subset. It does show the errors -- should I maybe do a merge on ID, then drop_duplicates?

CodePudding user response：

Here's an approach using compare.

Not sure if it's possible to have duplicate Name or Branch values with different ID's, so just in case, I'm using groupby.agg instead of set_index.

cols = ['Name','Branch']
for idx, col in enumerate(cols):
    usecols = cols[:idx 1]
    df1 = df_master.loc[df_master[usecols].apply(list, axis=1).isin(df_subset[usecols].apply(list, axis=1)), usecols   ['ID']].groupby(usecols).agg(list)
    df2 = df_subset.loc[df_subset[usecols].apply(list, axis=1).isin(df_master[usecols].apply(list, axis=1)), usecols   ['ID']].groupby(usecols).agg(list)
    print(df1.compare(df2).droplevel(0, axis=1).rename(columns={'self':'Master','other':'Subset'}))

      Master  Subset
Name                
Buck  [1002]  [1005]

             Master  Subset
Name Branch                
Buck BR3     [1002]  [1005]

You can change usecols = cols[:idx 1] to usecols = [col] if you'd rather take each column individually instead of cumulatively for the comparison index/key.

      Master  Subset
Name                
Buck  [1002]  [1005]

        Master  Subset
Branch                
BR3     [1002]  [1005]

CodePudding user response：

I found a better solution using indexes and isin:

inconsistent_mask = (~df_master.set_index(['ID', 'Name']).index.isin(df_subset.set_index(['ID', 'Name']).index) & df_master.Name.isin(df_subset.Name))

names = df_master.loc[inconsistent_mask, 'Name'].tolist()
master_ids = df_master.loc[inconsistent_mask, 'ID'].tolist()
subset_ids = df_subset.set_index('Name').loc[names, 'ID'].tolist()

report_df = pd.DataFrame({'Name': names, 'Master': master_ids, 'Subset': subset_ids})

Output:

>>> report_df
   Name  Master  Subset
0  Buck    1002    1005

CodePudding user response：

Here's a lengthy solution that works:

inconsistent = pd.concat([df_subset.set_index('Name')['ID'], df_master.set_index('Name')['ID']], axis=1).bfill(axis=1).ffill(axis=1).astype(int).diff(axis=1).iloc[:, -1].replace(0, np.nan).dropna().index.tolist()

Output:

>>> inconsistent
['Buck']

I'll try to come up with a better one though.