I have a dataframe 'df' where I would like to compare the list of column names to the values found under the dataframe titled 'set_cols' I have code that i previously used to compare a dictionary key to the df column name but i cant figure out how to make it work to compare a dataframe column heading values to another dataframe values under the heading
import pandas as pd
filename='template'
df= pd.DataFrame(columns=['firstName', 'lastName', 'state', 'Communication_Language__c',
'country', 'company', 'email', 'industry', 'System_Type__c',
'AccountType', 'customerSegment', 'Existing_Customer__c',
'GDPR_Email_Permission__c','persons name'])
data= ['firstName', 'lastName', 'state', 'Communication_Language__c',
'country', 'company', 'email', 'industry', 'System_Type__c',
'AccountType', 'customerSegment', 'Existing_Customer__c',
'GDPR_Email_Permission__c']
set_cols=pd.DataFrame(data, columns=['Numbers'])
errors= {}
errors[filename]={}
df_cols = df[list(df.columns)]
mask = df_cols.apply(lambda d: d.isin(set_cols[d.name]))
df_cols.mask(mask|df_cols.eq(' ')).stack()
for err_i, (r, v) in enumerate(df_cols.mask(mask|df_cols.eq(' ')).stack().iteritems()):
errors[filename][err_i] = {"column": r[1],
"message": r " is invalid column heading'}
in the errors dictionary I would expect an output something along the line of this:
{'column': 'person name', 'message': 'person name is an invalid column heading'}
How do i compare the heading column values of one data frame to a data frame with the values under the column?
CodePudding user response:
If I get you right the goal is to find the columns of 'df' that are absent in 'set_cols'.
In this case 2 sets substraction can be utilized.
import pandas as pd
filename='template'
df= pd.DataFrame(columns=['firstName', 'lastName', 'state', 'Communication_Language__c',
'country', 'company', 'email', 'industry', 'System_Type__c',
'AccountType', 'customerSegment', 'Existing_Customer__c',
'GDPR_Email_Permission__c','persons name'])
data= ['firstName', 'lastName', 'state', 'Communication_Language__c',
'country', 'company', 'email', 'industry', 'System_Type__c',
'AccountType', 'customerSegment', 'Existing_Customer__c',
'GDPR_Email_Permission__c']
set_cols=pd.DataFrame(data, columns=['Numbers'])
error_cols = set(df.columns) - set(set_cols.Numbers)
errors= {}
errors[filename]={}
for _col in error_cols:
errors[filename]["column"] = _col
errors[filename]["message"] = f"{_col} is an invalid column heading"
CodePudding user response:
It looks like a very complicated way to do:
diff = df.columns.difference(set_cols['Numbers'])
errors= {}
errors[filename] = dict(enumerate({'column': c,
'message': f'{c} is an invalid column heading'}
for c in diff))
print(errors)
output:
{'template': {0: {'column': 'persons name', 'message': 'persons name is an invalid column heading'}}}