Home > database >  comparing a dataframe column to the values found in another dataframe
comparing a dataframe column to the values found in another dataframe

Time:10-20

I have a dataframe 'df' where I would like to compare the list of column names to the values found under the dataframe titled 'set_cols' I have code that i previously used to compare a dictionary key to the df column name but i cant figure out how to make it work to compare a dataframe column heading values to another dataframe values under the heading

import pandas as pd

filename='template'
 
df= pd.DataFrame(columns=['firstName', 'lastName', 'state', 'Communication_Language__c',
       'country', 'company', 'email', 'industry', 'System_Type__c',
       'AccountType', 'customerSegment', 'Existing_Customer__c',
       'GDPR_Email_Permission__c','persons name'])

data= ['firstName', 'lastName', 'state', 'Communication_Language__c',
       'country', 'company', 'email', 'industry', 'System_Type__c',
       'AccountType', 'customerSegment', 'Existing_Customer__c',
       'GDPR_Email_Permission__c']

set_cols=pd.DataFrame(data, columns=['Numbers'])

errors= {}
errors[filename]={}

df_cols = df[list(df.columns)]
mask = df_cols.apply(lambda d: d.isin(set_cols[d.name]))
df_cols.mask(mask|df_cols.eq(' ')).stack()
for err_i, (r, v) in enumerate(df_cols.mask(mask|df_cols.eq(' ')).stack().iteritems()):
    errors[filename][err_i] = {"column": r[1],
                               "message": r   " is invalid column heading'}

in the errors dictionary I would expect an output something along the line of this:

{'column': 'person name', 'message': 'person name is an invalid column heading'}

How do i compare the heading column values of one data frame to a data frame with the values under the column?

CodePudding user response:

If I get you right the goal is to find the columns of 'df' that are absent in 'set_cols'.

In this case 2 sets substraction can be utilized.

import pandas as pd

filename='template'
 
df= pd.DataFrame(columns=['firstName', 'lastName', 'state', 'Communication_Language__c',
       'country', 'company', 'email', 'industry', 'System_Type__c',
       'AccountType', 'customerSegment', 'Existing_Customer__c',
       'GDPR_Email_Permission__c','persons name'])

data= ['firstName', 'lastName', 'state', 'Communication_Language__c',
       'country', 'company', 'email', 'industry', 'System_Type__c',
       'AccountType', 'customerSegment', 'Existing_Customer__c',
       'GDPR_Email_Permission__c']

set_cols=pd.DataFrame(data, columns=['Numbers'])

error_cols = set(df.columns) - set(set_cols.Numbers)

errors= {}
errors[filename]={}

for _col in error_cols:
    errors[filename]["column"] = _col
    errors[filename]["message"] = f"{_col} is an invalid column heading"

CodePudding user response:

It looks like a very complicated way to do:

diff = df.columns.difference(set_cols['Numbers'])

errors= {}
errors[filename] = dict(enumerate({'column': c,
                                   'message': f'{c} is an invalid column heading'}
                                  for c in diff))

print(errors)

output:

{'template': {0: {'column': 'persons name', 'message': 'persons name is an invalid column heading'}}}
  • Related