Context: I get a spreadsheet with numerous sheets. Each sheet references the same core columns but includes different subsets of the master sheet. How do I ensure each subset has the same ID married with the same secondary, tertiary column? This question starts with the assumption that we know how to load dataframes from excel:
Ex.
df_subset = pd.DataFrame([[1000, 'Jerry', 'BR1','BR1'],
[1001, 'Sal', 'BR2', 'BR1'],
[1005, 'Buck', 'BR3', 'BR2'],
[1003, 'Perry','BR4','BR1']],
columns=['ID', 'Name', 'Branch', 'Member of'])
df_master = pd.DataFrame([[1000, 'Jerry', 'BR1','Black'],
[1001, 'Sal', 'BR2', 'Mocha'],
[1002, 'Buck', 'BR3', 'Frap'],
[1003, 'Perry','BR4','Black'],
[1004, 'Mike', 'BR5', 'Mach']],
columns=['ID', 'Name', 'Branch', 'Coffee'])
Each of the above test frames represents a sheet. df_master
is the authoritative source of "Truth". How do I catch the error in df_subset
( See "Buck" )? Note the mismatched ID.
Expected output ( open to suggestions here, but it needs to flag errors ):
Master: 'Buck': 1002
Subset: 'Buck': 1005
If no mistakes, then no output.
UPDATE:
master_list = df_master[['ID','Name']].copy()
subset_list = df_subset[['ID','Name']].copy()
df_check = pd.concat([master_list,subset_list])
df_check.drop_duplicates(keep=False, ignore_index=False, subset=None)
The above code almost works but shows records present in the master that are not used in any subset. It does show the errors -- should I maybe do a merge
on ID
, then drop_duplicates
?
CodePudding user response:
Here's an approach using compare.
Not sure if it's possible to have duplicate Name
or Branch
values with different ID
's, so just in case, I'm using groupby.agg
instead of set_index
.
cols = ['Name','Branch']
for idx, col in enumerate(cols):
usecols = cols[:idx 1]
df1 = df_master.loc[df_master[usecols].apply(list, axis=1).isin(df_subset[usecols].apply(list, axis=1)), usecols ['ID']].groupby(usecols).agg(list)
df2 = df_subset.loc[df_subset[usecols].apply(list, axis=1).isin(df_master[usecols].apply(list, axis=1)), usecols ['ID']].groupby(usecols).agg(list)
print(df1.compare(df2).droplevel(0, axis=1).rename(columns={'self':'Master','other':'Subset'}))
Master Subset
Name
Buck [1002] [1005]
Master Subset
Name Branch
Buck BR3 [1002] [1005]
You can change usecols = cols[:idx 1]
to usecols = [col]
if you'd rather take each column individually instead of cumulatively for the comparison index/key.
Master Subset
Name
Buck [1002] [1005]
Master Subset
Branch
BR3 [1002] [1005]
CodePudding user response:
I found a better solution using indexes and isin
:
inconsistent_mask = (~df_master.set_index(['ID', 'Name']).index.isin(df_subset.set_index(['ID', 'Name']).index) & df_master.Name.isin(df_subset.Name))
names = df_master.loc[inconsistent_mask, 'Name'].tolist()
master_ids = df_master.loc[inconsistent_mask, 'ID'].tolist()
subset_ids = df_subset.set_index('Name').loc[names, 'ID'].tolist()
report_df = pd.DataFrame({'Name': names, 'Master': master_ids, 'Subset': subset_ids})
Output:
>>> report_df
Name Master Subset
0 Buck 1002 1005
CodePudding user response:
Here's a lengthy solution that works:
inconsistent = pd.concat([df_subset.set_index('Name')['ID'], df_master.set_index('Name')['ID']], axis=1).bfill(axis=1).ffill(axis=1).astype(int).diff(axis=1).iloc[:, -1].replace(0, np.nan).dropna().index.tolist()
Output:
>>> inconsistent
['Buck']
I'll try to come up with a better one though.