I have the below df columns in a list that I use below to make sure to add the columns are in the correct order:
import pandas as pd
column_header = ['blast_id', 'labels', 'name', 'subject', 'list', 'mode', 'copy_template', 'stats',
'start_time', 'modify_time', 'schedule_time', 'email_count']
df = df[column_header]
but the df is missing some values like labels
and name
, etc. How do I make sure that if any of the columns from column_header
are missing that we just add that column in and have null values?
CodePudding user response:
Assuming this input:
copy_template list modify_time blast_id name stats mode
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
You need to reindex
:
column_header = ['blast_id', 'labels', 'name', 'subject', 'list', 'mode', 'copy_template', 'stats',
'start_time', 'modify_time', 'schedule_time', 'email_count']
df.reindex(columns=column_header)
output:
blast_id labels name subject list mode copy_template stats start_time modify_time schedule_time email_count
0 1 NaN 1 NaN 1 1 1 1 NaN 1 NaN NaN
1 2 NaN 2 NaN 2 2 2 2 NaN 2 NaN NaN
CodePudding user response:
You can check for any missing columns and then set them to Nan for all rows:
# identify any cols that we don't have
missing_cols = [
col for col in df.columns
if col not in set(column_header)
]
# fill in each one with nan
for col in missing_cols:
df[col] = pd.np.nan
df[column_header]
or in a really short piece of code:
for col in column_header:
if col not in df.columns:
df[col] = pd.np.nan
CodePudding user response:
One-liner:
df[pd.Index(column_header).difference(df.columns)] = np.nan
To make sure the columns are all in the proper order, do this instead:
missing = pd.Index(column_header).difference(df.columns)
df[missing] = np.nan
df = df[column_header]