I have a dataframe like this -Please refer the dataframe as in the image shown
There are four columns('status','preferred_time','history','id'), need to check if all the columns have some values in it or not, in the history column ,its a nested list in some cases, so need to specially check nested list has all mandatory keys 'branch','rank','discharge_status','service_start',job_code','post_intention' have values in it, and add a column named "output" in the dataframe if all the columns have values then name that as "completed" else "pending" if blank or NaN or [{}] in any column or history column has any missing key value pair.
From the image , only the first row should be in completed state rest should fall in pending.
Please help me out in building better if else situation here in this scenario. Thanks in advance.
Dict of above df image -
{'status': {0: 'No', 1: 'No', 2: nan, 3: 'No', 4: 'No'},
'preferred_time': {0: "['Morning', 'Midday', 'Afternoon']",
1: [],
2: "['Morning'] ",
3: nan,
4: "['Morning', 'Midday'] "},
'history': {0: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1999-02-13', 'job_code': '09', 'post_intention': ['No']}]",
1: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1999-02-13', 'job_code': '09', 'post_intention': ['No']}]",
2: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1995-02-13', 'job_code': '09', 'post_intention': ['No']},{'branch': 'A', 'rank: 'E6', 'discharge_status': 'Honorable Discharge', 'service_start': '2015-02-13', 'job_code': '09'}]",
3: nan,
4: '[{}]'},
'id': {0: 1, 1: 5, 2: 2, 3: 3, 4: 4}}
I tried below lines of code - But I don't know how to check all the four columns in a single if statement -
for i in df.index:
status = df['status'][i]
preferred_time = df['preferred_time'][i]
id = df['id'][i]
history = df['history'][i]
if status and preferred_time and id and status!='' and preferred_time!= '' and id!='':
enroll_status = "completed"
else:
enroll_status = "pending"
if history!= '' or str(history)!= '[{}]':
for item in history:
if 'branch' in item.keys() and'rank' in item.keys() and'discharge_status' in item.keys() and'service_start' in item.keys() and 'job_code' in item.keys() and 'post_intention' in item.keys():
enroll_status = "completed"
else:
enroll_status = "pending"
CodePudding user response:
Consider the following:
import numpy as np
import pandas as pd
from numpy import nan
def check_list(L):
if not isinstance(L,list):
return False
return all(k in d for k in keys_req for d in L)
labels = np.array(["pending","completed"])
keys_req = ['branch','rank','discharge_status','service_start','job_code','post_intention']
d = {'status': {0: 'No', 1: 'No', 2: nan, 3: 'No', 4: 'No'}, 'preferred_time': {0: "['Morning', 'Midday', 'Afternoon']", 1: nan, 2: "['Morning'] ", 3: nan, 4: "['Morning', 'Midday'] "}, 'history': {0: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1999-02-13', 'job_code': '09', 'post_intention': ['No']}]", 1: nan, 2: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1995-02-13', 'job_code': '09', 'post_intention': ['No']},{'branch': 'A', 'rank': 'E6', 'discharge_status': 'Honorable Discharge', 'service_start': '2015-02-13', 'job_code': '09'}]", 3: nan, 4: '[{}]'}, 'id': {0: 1, 1: 5, 2: 2, 3: 3, 4: 4}}
df = pd.DataFrame(d)
df['history_list'] = df['history'].apply(lambda x: eval(x) if isinstance(x,str) else x)
df['mandatory_keys'] = df['history_list'].apply(check_list)
df['no_nans'] = ~pd.isna(df).any(axis = 1)
df['output_tf'] = df['mandatory_keys'] & df['no_nans']
df['output'] = labels[df['output_tf'].to_numpy(dtype=int)]
Note that I corrected some typos from your dataframe in my copied version of the dictionary d (for example, 'rank:'E7'
was replaced with 'rank':'E7'
). The incremental columns added (history_list, mandatory_keys, no_nans, output_tf) are there to make understanding the process that I applied here easier; it is not actually necessary to add these to the dataframe if, for example, you want to use as little space as possible. The script above results in the following dataframe df
:
status preferred_time \
0 No ['Morning', 'Midday', 'Afternoon']
1 No NaN
2 NaN ['Morning']
3 No NaN
4 No ['Morning', 'Midday']
history id \
0 [{'branch': 'A', 'rank': 'E7', 'discharge_stat... 1
1 NaN 5
2 [{'branch': 'A', 'rank': 'E7', 'discharge_stat... 2
3 NaN 3
4 [{}] 4
history_list mandatory_keys no_nans \
0 [{'branch': 'A', 'rank': 'E7', 'discharge_stat... True True
1 NaN False False
2 [{'branch': 'A', 'rank': 'E7', 'discharge_stat... False False
3 NaN False False
4 [{}] False True
output_tf output
0 True completed
1 False pending
2 False pending
3 False pending
4 False pending
Here's a more parsimonious version (which doesn't add the unnecessary columns or store the extra "labels" variable).
import numpy as np
import pandas as pd
from numpy import nan
def check_list(L):
if not isinstance(L,list):
return False
return all(k in d for k in keys_req for d in L)
keys_req = ['branch','rank','discharge_status','service_start','job_code','post_intention']
d = {'status': {0: 'No', 1: 'No', 2: nan, 3: 'No', 4: 'No'}, 'preferred_time': {0: "['Morning', 'Midday', 'Afternoon']", 1: nan, 2: "['Morning'] ", 3: nan, 4: "['Morning', 'Midday'] "}, 'history': {0: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1999-02-13', 'job_code': '09', 'post_intention': ['No']}]", 1: nan, 2: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1995-02-13', 'job_code': '09', 'post_intention': ['No']},{'branch': 'A', 'rank': 'E6', 'discharge_status': 'Honorable Discharge', 'service_start': '2015-02-13', 'job_code': '09'}]", 3: nan, 4: '[{}]'}, 'id': {0: 1, 1: 5, 2: 2, 3: 3, 4: 4}}
df = pd.DataFrame(d)
df['output'] = np.array(["pending","completed"])[
(df['history'].apply(lambda x: eval(x) if isinstance(x,str) else x)
.apply(check_list)
& ~pd.isna(df).any(axis = 1)
).to_numpy(dtype=int)]