Check the null for columns first_name and last_name, if found remove those rows from df and put into-CodePudding

If in the input data first_name and last_name is null then remove those records from df and put into new dataframe error df with extra columns "rejeted_reason": ['first_name,'last_name] is empty.

Input Data:

customer_number|first_name|middle_name|last_name|gender
90617174||Aliari||Male
92154246|Roberto||Intriago Nunez|Male
07605348|E|A|Christodoulou|Male
80284242|Ritchie|O||Male

Error File :

customer_number|first_name|middle_name|last_name|gender|rejection_reason
90617174||Aliari||Male|["first_name","last_name] is empty
80284242|Ritchie|O||Male|["last_name"] is empty

Output File:

customer_number|first_name|middle_name|last_name|gender
92154246|Roberto||Intriago Nunez|Male
07605348|E|A|Christodoulou|Male

Code Tried:

newList = ['first_name','last_name']
for index,row in df.iterrows():
    error_col = []
    temp_dic = []
    for col in newList:
        if (row[col] == '' or pd.isna(row[col]) or pd.isnull(row[col])):
            error_col.append(col)
            row["rejection_reason"] = col   ' is empty'
            df.drop(index, inplace=True)
            temp_dic.append(row)
    print("temp dic:", temp_dic)

Error

raise KeyError(f"{labels[mask]} not found in axis")
KeyError: '[0] not found in axis'

CodePudding user response：

Since you drop the row each time you see empty column in that row, you are removing a row more than once. So after removing the row for the first time, you see KeyError. This should work:

newList = ['first_name','last_name']

temp_dic = []
for index,row in df.iterrows():
    error_col = []
    for col in newList:
        if (row[col] == '' or pd.isna(row[col]) or pd.isnull(row[col])):
            error_col.append(col)
    
    if len(error_col) > 0:
        df.drop(index, inplace=True)
        temp_dic.append(row)
        row["rejection_reason"] = str(error_col)   ' is empty'

print("temp dic:", temp_dic)

But I recommend doing this instead of processing each row:

newList = ['first_name','last_name']

def check_columns(row):
    error_col = []
    for col in newList:
        if (row[col] == '' or pd.isna(row[col]) or pd.isnull(row[col])):
            error_col.append(col)
    
    if len(error_col) > 0:
        return str(error_col)   ' is empty'
    else:
        return ''

df['rejection_reason'] = df.apply(check_columns, axis=1)
df_error = df[df['rejection_reason'] != '']
df_output = df[df['rejection_reason'] == '']

CodePudding user response：

given the input datframe df, in order to filter out the rows where 'last_name' or 'first_name' are null the following will work:

filter_df = df[~(df['first_name'].isnull() | df['last_name'].isnull())]

In order to create the error dataframe containing rows that have 'last_name' or 'first_name' null and the corresponding error the following code will work:

error_df = df[(df['first_name'].isnull() | df['last_name'].isnull())].copy()
error_df.loc[error_df['first_name'].isnull(), "rejeted_reason"] = "['first_name'] is empty."
error_df.loc[error_df['last_name'].isnull(), "rejeted_reason"] = "['last_name'] is empty."
error_df.loc[(error_df['first_name'].isnull() & error_df['last_name'].isnull()), "rejeted_reason"] = "['first_name', 'last_name'] is empty."

Output of filter_df given the above input:

Output of error_df given the above input: