I have four lists of standard values that I would to compare to a corresponding column.
Here are the four lists:
industry_US=['Automotive', 'Retail', 'Gas', 'Other']
industry_EU=['Real Estate', 'Transport', 'Mining']
systype= ['Access', 'Video System']
accttype=['Commercial','Reseller','Small']
Here is my Dataframe:
data= [['Automotive','Education','Enterance','Commercial'],['Gas','Hotels','Access',' '],['Healthcare'," " ,'Video System','Reseller']]
df_test=pd.DataFrame(data,columns=['Industry_US','Industry_EU','System Type','Account Type'])
Here is my desired output
{
'Filename' : {error found in 'System Type', 'The value entered in System Type is not a standard value'}
}
This is what I have so far for this process but I am not sure how to write everything to a double nested dictionary.
import pandas as pd
dict= {'file name': {'error':'description of the error'}}
data= [['Automotive','Education','Enterance','Commercial'],['Gas','Hotels','Access',' '],['Healthcare'," " ,'Video System','Reseller']]
df=pd.DataFrame(data,columns=['Industry_US','Industry_EU','System Type','Account Type'])
industry_US=['Automotive', 'Retail', 'Gas', 'Other']
industry_EU=['Real Estate', 'Transport', 'Mining']
systype= ['Access', 'Video System']
accttype=['Commercial','Reseller','Small']
col_list=['industry_US','industry_EU','systype', 'accttype']
if col_list.isin df:
return None
else:
dict.update({col_list : 'the value entered does not match the standard values'})
I would like to compare these four columns to the four list using a function to keep the code as efficient as possible. If the values in the columns match the value in the list nothing needs to happen but if the values don't match put the non-matching values in a nested dictionary. I am not concerned about the NaN values so those can stay out of the dictionary.
With the code I have I get errors like AttributeError: 'str' object has no attribute 'isin' Any suggestion on how to efficiently compare multiple lists to multiple df columns?
CodePudding user response:
I don't fully understand which output you want, but you can easily get the list of invalid values with:
valid = {'Industry_US': ['Automotive', 'Retail', 'Gas', 'Other'],
'Industry_EU': ['Real Estate', 'Transport', 'Mining'],
'System Type': ['Access', 'Video System'],
'Account Type': ['Commercial', 'Reseller', 'Small']}
mask = df_test.apply(lambda c: c.isin(valid[c.name]))
df_test.mask(mask|df_test.eq(' ')).stack()
output:
0 Industry_EU Education
System Type Enterance
1 Industry_EU Hotels
2 Industry_US Healthcare
dtype: object
displaying an error message:
for i, v in df_test.mask(mask|df_test.eq(' ')).stack().iteritems():
print(f'error found in row "{i[0]}", column "{i[1]}": "{v}" is invalid')
output:
error found in row "0", column "Industry_EU": "Education" is invalid
error found in row "0", column "System Type": "Enterance" is invalid
error found in row "1", column "Industry_EU": "Hotels" is invalid
error found in row "2", column "Industry_US": "Healthcare" is invalid