how to apply a check on below scenarios using if else conditions?-CodePudding

I have a dataframe like this -Please refer the dataframe as in the image shown

There are four columns('status','preferred_time','history','id'), need to check if all the columns have some values in it or not, in the history column ,its a nested list in some cases, so need to specially check nested list has all mandatory keys 'branch','rank','discharge_status','service_start',job_code','post_intention' have values in it, and add a column named "output" in the dataframe if all the columns have values then name that as "completed" else "pending" if blank or NaN or [{}] in any column or history column has any missing key value pair.

From the image , only the first row should be in completed state rest should fall in pending.

Please help me out in building better if else situation here in this scenario. Thanks in advance.

Dict of above df image -

{'status': {0: 'No', 1: 'No', 2: nan, 3: 'No', 4: 'No'},
 'preferred_time': {0: "['Morning', 'Midday', 'Afternoon']",
  1: [],
  2: "['Morning'] ",
  3: nan,
  4: "['Morning', 'Midday'] "},
 'history': {0: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1999-02-13', 'job_code': '09', 'post_intention': ['No']}]",
  1: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1999-02-13', 'job_code': '09', 'post_intention': ['No']}]",
  2: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1995-02-13', 'job_code': '09', 'post_intention': ['No']},{'branch': 'A', 'rank: 'E6', 'discharge_status': 'Honorable Discharge', 'service_start': '2015-02-13', 'job_code': '09'}]",
  3: nan,
  4: '[{}]'},
 'id': {0: 1, 1: 5, 2: 2, 3: 3, 4: 4}}

I tried below lines of code - But I don't know how to check all the four columns in a single if statement -

for i in df.index:
    status = df['status'][i]
    preferred_time = df['preferred_time'][i]
    id = df['id'][i]
    history = df['history'][i]
    if status and preferred_time and id and status!='' and preferred_time!= '' and id!='':
        enroll_status = "completed"
    else:
        enroll_status = "pending"
    if history!= '' or str(history)!= '[{}]':
        for item in history:
            if 'branch' in item.keys() and'rank' in item.keys() and'discharge_status' in item.keys() and'service_start' in item.keys() and 'job_code' in item.keys() and 'post_intention' in item.keys():
                enroll_status = "completed"
            else:
                enroll_status = "pending"

CodePudding user response：

Consider the following:

import numpy as np
import pandas as pd
from numpy import nan

def check_list(L):
    if not isinstance(L,list):
        return False
    return all(k in d for k in keys_req for d in L)

labels = np.array(["pending","completed"])
keys_req = ['branch','rank','discharge_status','service_start','job_code','post_intention']
d = {'status': {0: 'No', 1: 'No', 2: nan, 3: 'No', 4: 'No'}, 'preferred_time': {0: "['Morning', 'Midday', 'Afternoon']", 1: nan, 2: "['Morning'] ", 3: nan, 4: "['Morning', 'Midday'] "}, 'history': {0: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1999-02-13', 'job_code': '09', 'post_intention': ['No']}]", 1: nan, 2: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1995-02-13', 'job_code': '09', 'post_intention': ['No']},{'branch': 'A', 'rank': 'E6', 'discharge_status': 'Honorable Discharge', 'service_start': '2015-02-13', 'job_code': '09'}]", 3: nan, 4: '[{}]'}, 'id': {0: 1, 1: 5, 2: 2, 3: 3, 4: 4}}

df = pd.DataFrame(d)
df['history_list'] = df['history'].apply(lambda x: eval(x) if isinstance(x,str) else x)
df['mandatory_keys'] = df['history_list'].apply(check_list)
df['no_nans'] = ~pd.isna(df).any(axis = 1)
df['output_tf'] = df['mandatory_keys'] & df['no_nans']
df['output'] = labels[df['output_tf'].to_numpy(dtype=int)]

Note that I corrected some typos from your dataframe in my copied version of the dictionary d (for example, 'rank:'E7' was replaced with 'rank':'E7'). The incremental columns added (history_list, mandatory_keys, no_nans, output_tf) are there to make understanding the process that I applied here easier; it is not actually necessary to add these to the dataframe if, for example, you want to use as little space as possible. The script above results in the following dataframe df:

  status                      preferred_time  \
0     No  ['Morning', 'Midday', 'Afternoon']   
1     No                                 NaN   
2    NaN                        ['Morning']    
3     No                                 NaN   
4     No              ['Morning', 'Midday']    

                                             history  id  \
0  [{'branch': 'A', 'rank': 'E7', 'discharge_stat...   1   
1                                                NaN   5   
2  [{'branch': 'A', 'rank': 'E7', 'discharge_stat...   2   
3                                                NaN   3   
4                                               [{}]   4   

                                        history_list  mandatory_keys  no_nans  \
0  [{'branch': 'A', 'rank': 'E7', 'discharge_stat...            True     True   
1                                                NaN           False    False   
2  [{'branch': 'A', 'rank': 'E7', 'discharge_stat...           False    False   
3                                                NaN           False    False   
4                                               [{}]           False     True   

   output_tf     output  
0       True  completed  
1      False    pending  
2      False    pending  
3      False    pending  
4      False    pending

Here's a more parsimonious version (which doesn't add the unnecessary columns or store the extra "labels" variable).

import numpy as np
import pandas as pd
from numpy import nan

def check_list(L):
    if not isinstance(L,list):
        return False
    return all(k in d for k in keys_req for d in L)

keys_req = ['branch','rank','discharge_status','service_start','job_code','post_intention']
d = {'status': {0: 'No', 1: 'No', 2: nan, 3: 'No', 4: 'No'}, 'preferred_time': {0: "['Morning', 'Midday', 'Afternoon']", 1: nan, 2: "['Morning'] ", 3: nan, 4: "['Morning', 'Midday'] "}, 'history': {0: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1999-02-13', 'job_code': '09', 'post_intention': ['No']}]", 1: nan, 2: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1995-02-13', 'job_code': '09', 'post_intention': ['No']},{'branch': 'A', 'rank': 'E6', 'discharge_status': 'Honorable Discharge', 'service_start': '2015-02-13', 'job_code': '09'}]", 3: nan, 4: '[{}]'}, 'id': {0: 1, 1: 5, 2: 2, 3: 3, 4: 4}}    
df = pd.DataFrame(d)

df['output'] = np.array(["pending","completed"])[
    (df['history'].apply(lambda x: eval(x) if isinstance(x,str) else x)
                  .apply(check_list)
     & ~pd.isna(df).any(axis = 1)
     ).to_numpy(dtype=int)]