Pandas - How to make sure if a dataframe is missing some columns they are just created and set to nu-CodePudding

I have the below df columns in a list that I use below to make sure to add the columns are in the correct order:

import pandas as pd

column_header = ['blast_id', 'labels', 'name', 'subject', 'list', 'mode', 'copy_template', 'stats',
                          'start_time', 'modify_time', 'schedule_time', 'email_count']


df = df[column_header]

but the df is missing some values like labels and name, etc. How do I make sure that if any of the columns from column_header are missing that we just add that column in and have null values?

CodePudding user response：

Assuming this input:

   copy_template  list  modify_time  blast_id  name  stats  mode
0              1     1            1         1     1      1     1
1              2     2            2         2     2      2     2

You need to reindex:

column_header = ['blast_id', 'labels', 'name', 'subject', 'list', 'mode', 'copy_template', 'stats',
                          'start_time', 'modify_time', 'schedule_time', 'email_count']

df.reindex(columns=column_header)

output:

   blast_id  labels  name  subject  list  mode  copy_template  stats  start_time  modify_time  schedule_time  email_count
0         1     NaN     1      NaN     1     1              1      1         NaN            1            NaN          NaN
1         2     NaN     2      NaN     2     2              2      2         NaN            2            NaN          NaN

CodePudding user response：

You can check for any missing columns and then set them to Nan for all rows:

# identify any cols that we don't have
missing_cols = [
    col for col in df.columns 
    if col not in set(column_header)
]

# fill in each one with nan
for col in missing_cols:
    df[col] = pd.np.nan

df[column_header]

or in a really short piece of code:

for col in column_header:
    if col not in df.columns: 
        df[col] = pd.np.nan

CodePudding user response：

One-liner:

df[pd.Index(column_header).difference(df.columns)] = np.nan

To make sure the columns are all in the proper order, do this instead:

missing = pd.Index(column_header).difference(df.columns)
df[missing] = np.nan
df = df[column_header]