Home > Back-end >  Pandas: best way to check integrity of dataframe columns across numerous copies
Pandas: best way to check integrity of dataframe columns across numerous copies

Time:12-29

Context: I get a spreadsheet with numerous sheets. Each sheet references the same core columns but includes different subsets of the master sheet. How do I ensure each subset has the same ID married with the same secondary, tertiary column? This question starts with the assumption that we know how to load dataframes from excel:

Ex.

df_subset = pd.DataFrame([[1000, 'Jerry', 'BR1','BR1'], 
                         [1001, 'Sal', 'BR2', 'BR1'], 
                         [1005, 'Buck', 'BR3', 'BR2'],
                         [1003, 'Perry','BR4','BR1']],
                         columns=['ID', 'Name', 'Branch', 'Member of'])

df_master = pd.DataFrame([[1000, 'Jerry', 'BR1','Black'], 
                         [1001, 'Sal', 'BR2', 'Mocha'], 
                         [1002, 'Buck', 'BR3', 'Frap'],
                         [1003, 'Perry','BR4','Black'],
                         [1004, 'Mike', 'BR5', 'Mach']],
                         columns=['ID', 'Name', 'Branch', 'Coffee'])

Each of the above test frames represents a sheet. df_master is the authoritative source of "Truth". How do I catch the error in df_subset ( See "Buck" )? Note the mismatched ID.

Expected output ( open to suggestions here, but it needs to flag errors ):

Master: 'Buck': 1002
Subset: 'Buck': 1005

If no mistakes, then no output.

UPDATE:

master_list = df_master[['ID','Name']].copy()
subset_list = df_subset[['ID','Name']].copy()

df_check = pd.concat([master_list,subset_list])

df_check.drop_duplicates(keep=False, ignore_index=False, subset=None)

The above code almost works but shows records present in the master that are not used in any subset. It does show the errors -- should I maybe do a merge on ID, then drop_duplicates?

CodePudding user response:

Here's an approach using compare.

Not sure if it's possible to have duplicate Name or Branch values with different ID's, so just in case, I'm using groupby.agg instead of set_index.

cols = ['Name','Branch']
for idx, col in enumerate(cols):
    usecols = cols[:idx 1]
    df1 = df_master.loc[df_master[usecols].apply(list, axis=1).isin(df_subset[usecols].apply(list, axis=1)), usecols   ['ID']].groupby(usecols).agg(list)
    df2 = df_subset.loc[df_subset[usecols].apply(list, axis=1).isin(df_master[usecols].apply(list, axis=1)), usecols   ['ID']].groupby(usecols).agg(list)
    print(df1.compare(df2).droplevel(0, axis=1).rename(columns={'self':'Master','other':'Subset'}))
      Master  Subset
Name                
Buck  [1002]  [1005]

             Master  Subset
Name Branch                
Buck BR3     [1002]  [1005]

You can change usecols = cols[:idx 1] to usecols = [col] if you'd rather take each column individually instead of cumulatively for the comparison index/key.

      Master  Subset
Name                
Buck  [1002]  [1005]

        Master  Subset
Branch                
BR3     [1002]  [1005]

CodePudding user response:

I found a better solution using indexes and isin:

inconsistent_mask = (~df_master.set_index(['ID', 'Name']).index.isin(df_subset.set_index(['ID', 'Name']).index) & df_master.Name.isin(df_subset.Name))

names = df_master.loc[inconsistent_mask, 'Name'].tolist()
master_ids = df_master.loc[inconsistent_mask, 'ID'].tolist()
subset_ids = df_subset.set_index('Name').loc[names, 'ID'].tolist()

report_df = pd.DataFrame({'Name': names, 'Master': master_ids, 'Subset': subset_ids})

Output:

>>> report_df
   Name  Master  Subset
0  Buck    1002    1005

CodePudding user response:

Here's a lengthy solution that works:

inconsistent = pd.concat([df_subset.set_index('Name')['ID'], df_master.set_index('Name')['ID']], axis=1).bfill(axis=1).ffill(axis=1).astype(int).diff(axis=1).iloc[:, -1].replace(0, np.nan).dropna().index.tolist()

Output:

>>> inconsistent
['Buck']

I'll try to come up with a better one though.

  • Related